re functions str/unicode problems #273

Dakkaron · 2016-06-07T16:16:30Z

Currenty all functions in re are typed with AnyStr, so for example re.match is defined as follows:

@overload
def match(pattern: AnyStr, string: AnyStr, flags: int = ...) -> Match[AnyStr]: ...

This is a problem, because since AnyStris defined as

AnyStr = TypeVar('AnyStr', bytes, str)

You cannot use, for example, a str pattern on an unicode string, which does work fine in Python 2.7.

Which way should this be changed to, so that typeshed reflects the reality?

The same problem appears in a lot of places, since it is used as if it was Union[str, unicode].

The text was updated successfully, but these errors were encountered:

gvanrossum · 2016-06-07T16:25:38Z

I'm of two minds; it would be better not to rely on this, because your code will break in Python 3; but then this behavior seems to be relied on commonly. I propose to wait until we've truly addressed python/typing#208.

Dakkaron · 2016-06-07T16:49:29Z

I see, what you mean.

For this specific case I would recommend using something like this:

def match(pattern: Union[str, unicode], string: AnyStr, flags: int = ...) -> Match[AnyStr]: ...

The reason would be, that if string is str, then the return value will always be Match[str], no matter what type pattern is. If the string is unicode, then the return value will always be Match[unicode], no matter what type pattern is.

>>> re.match('(a)','a').groups()
('a',)
>>> re.match('(a)',u'a').groups()
(u'a',)
>>> re.match(u'(a)','a').groups()
('a',)
>>> re.match(u'(a)',u'a').groups()
(u'a',)

gvanrossum · 2016-06-07T16:52:39Z

That's a good idea. Could you submit a patch?

ddfisher · 2016-06-07T18:28:14Z

We just recently landed and reverted a patch that did this, #244. Partly because there was a mistake in the way the Unions of Patterns were written, but also because we decided to wait for python/typing#208. From what I've seen so far, the usual case for a string pattern/unicode match is when the string is just a static string literal -- in that case, the best fix is to just make it a unicode literal. This is also a case which is likely to be allowed in python/typing#208 without needing changes here.

Also important to note: in general, testing with just ASCII is not a guarantee of compatibility. In this case, using non-ASCII codepoints while mixing bytes and unicode not cause a runtime exception, but will likely result in missing matches you might otherwise expect.

I don't think I'm against this patch; just wanted to give some context.

ddfisher · 2016-06-07T19:08:43Z

Actually, I think we need this because of zulip/zulip#936 (you can't write raw, explicit unicode literals in Python 3).

Dakkaron · 2016-06-08T08:16:10Z

I see what you mean.

I am just thinking, there where no new posts at python/typing#208 and python/mypy#1141 since April, so I figure, a solution to that problem will take a while (which is understandable, since it is really not a trivial problem to sort out). Should we maybe implement a patch like this for the time being until there is a resolution on this matter?

In our main project we have data from different sources coming in as either unicode or str and we want to use the same regexes (precompiled) on both. If we define the regex as unicode, mypy will think, that the output from e.g. re.sub(u'x','y','x') is of type unicode, whereas it is of type str, which could cause problems afterwards. Also stuff like this fails: x = re.compile(u"x").sub("y","x") # type: str

gvanrossum · 2016-06-08T20:53:44Z

Yeah, a PR that makes re.sub()'s first arg Union[str, unicode] would be fine.

JelleZijlstra · 2018-04-04T23:57:48Z

In Python 2, all re.pyi functions now accept both str and unicode for their pattern arguments, so I don't think there's anything else to do here.

Dakkaron mentioned this issue Jun 9, 2016

re methods' pattern-parameters don't affect the return value anymore #281

Merged

gvanrossum added the bytes-unicode label Aug 5, 2016

gvanrossum added the help wanted label Mar 29, 2017

posita mentioned this issue Apr 19, 2017

re match objects fail validation when using string as argument to group python/mypy#3199

Closed

JelleZijlstra mentioned this issue Apr 23, 2017

matching bytes regular expression against unicode #1133

Closed

gvanrossum removed the help wanted label Apr 30, 2017

JelleZijlstra closed this as completed Apr 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re functions str/unicode problems #273

re functions str/unicode problems #273

Dakkaron commented Jun 7, 2016 •

edited

Loading

gvanrossum commented Jun 7, 2016

Dakkaron commented Jun 7, 2016 •

edited

Loading

gvanrossum commented Jun 7, 2016

ddfisher commented Jun 7, 2016

ddfisher commented Jun 7, 2016

Dakkaron commented Jun 8, 2016 •

edited

Loading

gvanrossum commented Jun 8, 2016 via email

JelleZijlstra commented Apr 4, 2018

re functions str/unicode problems #273

re functions str/unicode problems #273

Comments

Dakkaron commented Jun 7, 2016 • edited Loading

gvanrossum commented Jun 7, 2016

Dakkaron commented Jun 7, 2016 • edited Loading

gvanrossum commented Jun 7, 2016

ddfisher commented Jun 7, 2016

ddfisher commented Jun 7, 2016

Dakkaron commented Jun 8, 2016 • edited Loading

gvanrossum commented Jun 8, 2016 via email

JelleZijlstra commented Apr 4, 2018

Dakkaron commented Jun 7, 2016 •

edited

Loading

Dakkaron commented Jun 7, 2016 •

edited

Loading

Dakkaron commented Jun 8, 2016 •

edited

Loading