Skip to content

re functions str/unicode problems #273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Dakkaron opened this issue Jun 7, 2016 · 8 comments
Closed

re functions str/unicode problems #273

Dakkaron opened this issue Jun 7, 2016 · 8 comments

Comments

@Dakkaron
Copy link
Contributor

Dakkaron commented Jun 7, 2016

Currenty all functions in re are typed with AnyStr, so for example re.match is defined as follows:

@overload
def match(pattern: AnyStr, string: AnyStr, flags: int = ...) -> Match[AnyStr]: ...

This is a problem, because since AnyStris defined as

AnyStr = TypeVar('AnyStr', bytes, str)

You cannot use, for example, a str pattern on an unicode string, which does work fine in Python 2.7.

Which way should this be changed to, so that typeshed reflects the reality?

The same problem appears in a lot of places, since it is used as if it was Union[str, unicode].

@gvanrossum
Copy link
Member

I'm of two minds; it would be better not to rely on this, because your code will break in Python 3; but then this behavior seems to be relied on commonly. I propose to wait until we've truly addressed python/typing#208.

@Dakkaron
Copy link
Contributor Author

Dakkaron commented Jun 7, 2016

I see, what you mean.

For this specific case I would recommend using something like this:

def match(pattern: Union[str, unicode], string: AnyStr, flags: int = ...) -> Match[AnyStr]: ...

The reason would be, that if string is str, then the return value will always be Match[str], no matter what type pattern is. If the string is unicode, then the return value will always be Match[unicode], no matter what type pattern is.

>>> re.match('(a)','a').groups()
('a',)
>>> re.match('(a)',u'a').groups()
(u'a',)
>>> re.match(u'(a)','a').groups()
('a',)
>>> re.match(u'(a)',u'a').groups()
(u'a',)

@gvanrossum
Copy link
Member

That's a good idea. Could you submit a patch?

@ddfisher
Copy link
Contributor

ddfisher commented Jun 7, 2016

We just recently landed and reverted a patch that did this, #244. Partly because there was a mistake in the way the Unions of Patterns were written, but also because we decided to wait for python/typing#208. From what I've seen so far, the usual case for a string pattern/unicode match is when the string is just a static string literal -- in that case, the best fix is to just make it a unicode literal. This is also a case which is likely to be allowed in python/typing#208 without needing changes here.

Also important to note: in general, testing with just ASCII is not a guarantee of compatibility. In this case, using non-ASCII codepoints while mixing bytes and unicode not cause a runtime exception, but will likely result in missing matches you might otherwise expect.

I don't think I'm against this patch; just wanted to give some context.

@ddfisher
Copy link
Contributor

ddfisher commented Jun 7, 2016

Actually, I think we need this because of zulip/zulip#936 (you can't write raw, explicit unicode literals in Python 3).

@Dakkaron
Copy link
Contributor Author

Dakkaron commented Jun 8, 2016

I see what you mean.

I am just thinking, there where no new posts at python/typing#208 and python/mypy#1141 since April, so I figure, a solution to that problem will take a while (which is understandable, since it is really not a trivial problem to sort out). Should we maybe implement a patch like this for the time being until there is a resolution on this matter?

In our main project we have data from different sources coming in as either unicode or str and we want to use the same regexes (precompiled) on both. If we define the regex as unicode, mypy will think, that the output from e.g. re.sub(u'x','y','x') is of type unicode, whereas it is of type str, which could cause problems afterwards. Also stuff like this fails: x = re.compile(u"x").sub("y","x") # type: str

@gvanrossum
Copy link
Member

gvanrossum commented Jun 8, 2016 via email

@JelleZijlstra
Copy link
Member

In Python 2, all re.pyi functions now accept both str and unicode for their pattern arguments, so I don't think there's anything else to do here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants