Skip to content

[gyb] Python 2 or 3 compatible Generate Your Boilerplate #806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 31, 2015

Conversation

RLovelett
Copy link
Contributor

EDIT: I've edited it since I first submitted this pull-request. I've added all the other patches I found necessary to get the gyb script to run on either Python 2 or 3 (as opposed to the single patch I submitted in the first place). As result this comment has been reformatted. My goal was to write fairly explicit commit messages that explain what each commit does on its own. They may be worth reviewing to figure out why each change was made.

@RLovelett RLovelett force-pushed the gyb-python-3 branch 2 times, most recently from d944e27 to 42604fa Compare December 28, 2015 20:57
@gribozavr gribozavr self-assigned this Dec 29, 2015
@RLovelett
Copy link
Contributor Author

I've gone ahead and put all of my Python 2/3 code on this branch. At this point I am able to compile, with these patches, using either Python 2.7.11 (OS X) or 3.5.1 (Arch Linux). 🎉

I've deleted my gyb-python-3-wip branch because it no longer is a work in progress. If you want me to split up these commits into multiple pulls just let me know.

@RLovelett RLovelett changed the title [gyb] Use partial function application to work around PEP 3114 [gyb] Python 2 or 3 compatible Generate Your Boilerplate Dec 30, 2015
@RLovelett RLovelett force-pushed the gyb-python-3 branch 4 times, most recently from 878f13e to b1e4585 Compare December 30, 2015 21:16
@RLovelett
Copy link
Contributor Author

Hopefully this should help solve some of the issues that were left by #83 and #90.

@@ -27,7 +27,7 @@ lineForName = {}

# Load the data file.
with open(CFDatabaseFile, 'rb') as f:
for line in f:
for line in [line_bytes.decode('utf-8') for line_bytes in f]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should make this:

for line in (line_bytes.decode('utf-8') for line_bytes in f):

so that it utilizes generators instead of building up a list unnecessarily

https://www.python.org/dev/peps/pep-0289/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a good suggestion. I've never written Python before and I hadn't noticed the difference between the parenthesis and square brackets.

Though in the end I actually found there is a codecs module that does what I am looking for explicitly and I've refactored the code to make use of that instead.

@RLovelett RLovelett force-pushed the gyb-python-3 branch 3 times, most recently from 8a96152 to 464426a Compare December 31, 2015 17:31
@@ -304,7 +309,7 @@ def splitGybLines(sourceLines):
dedents = 0
try:
for tokenKind, tokenText, tokenStart, (tokenEndLine, tokenEndCol), lineText \
in tokenize.generate_tokens(sourceLines.__iter__().next):
in tokenize.generate_tokens(partial(next, sourceLines.__iter__())):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uses of partial don’t read very well. Would we be better off with a lambda?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

sourceLinesIter = iter(sourceLines)
try:
    for tokenKind, tokenText, tokenStart, (tokenEndLine, tokenEndCol), lineText \
        in tokenize.generate_tokens(next(sourceLinesIter)):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be fine except the result of next(sourceLinesIter) is not callable. Taken from tokenize.generate_tokens docs

... requires one argument, readline, which must be a callable object which provides the same interface as the readline() method of built-in file objects (see section File Objects) ...

Notice that in the original code next was sent to tokenize.generate_tokens not next(), i.e., something that is callable. @frewsxcv suggestion is the equivalent of sending next() and not next. Hence why I went with the partial function application.

Testing

I first applied this patch:

diff --git a/utils/gyb.py b/utils/gyb.py
index ab422f1..68a8280 100755
--- a/utils/gyb.py
+++ b/utils/gyb.py
@@ -307,9 +307,10 @@ def splitGybLines(sourceLines):
     unmatchedIndents = []

     dedents = 0
+    sourceLinesIter = iter(sourceLines)
     try:
         for tokenKind, tokenText, tokenStart, (tokenEndLine, tokenEndCol), lineText \
-            in tokenize.generate_tokens(partial(next, sourceLines.__iter__())):
+            in tokenize.generate_tokens(next(sourceLinesIter)):

             if tokenKind in (tokenize.COMMENT, tokenize.ENDMARKER):
                 continue

Then I ran this from the utils directory.

$ python3
>>> from gyb import *
>>> src = splitLines('''\
... if x:
...     print x
... if y: # trailing comment
...     print z
...     if z: # another comment\
... ''')
>>> splitGybLines(src)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lovelett/Source/swift/utils/gyb.py", line 313, in splitGybLines
  in tokenize.generate_tokens(next(sourceLinesIter)):
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/tokenize.py", line 514, in _tokenize
  line = readline()
TypeError: 'str' object is not callable

Expected:

$ python3
>>> from gyb import *
>>> src = splitLines('''\
... if x:
...     print x
... if y: # trailing comment
...     print z
...     if z: # another comment\
... ''')
>>> splitGybLines(src)
[3, 5]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blah, you're right. Forgot that generate_tokens wants a callable. You might be able to get by with:

tokenize.generate_tokens(sourceLinesIter.next)

...since next is a method on an iterator, though I'm not entirely sure. Otherwise, I guess just stick with a lambda or partial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frewsxcv unfortunately next only exists on an iterator in Python 2. PEP 3114 renamed the next function to __next__ in Python 3 and got rid of next entirely. Or put another way, it works as expected in Python 2 and fails with the runtime error AttributeError: 'list_iterator' object has no attribute 'next' if you run the same code in Python 3.

Like I said before somewhere else, this is the first Python code I've ever written or worked with. I say that to say please forgive my ignorance when asking what may be a stupid question: what is a lambda? Is that some sort of Python keyword?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lambda is a keyword to create an anonymous function in Python. Here's a short demo:

>>> some_string = "hello!"
>>> some_string
'hello!'
>>> some_function = lambda: some_string
>>> some_function()
'hello!'
>>> another_function = lambda s: some_string + s
>>> another_function(" world!")
'hello! world!'

So in this particular case, you'd probably want to do something like this:

tokenize.generate_tokens(lambda: next(sourceLinesIter))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and switched over to the lambda syntax (see: dc70d6d06119b72f3733da56fccb3c2d66082d8a). As a newbie to Python this feels like a lateral move but I defer to those who have been there before me.

@frewsxcv thanks for showing me some Python! 🍻

@RLovelett RLovelett force-pushed the gyb-python-3 branch 3 times, most recently from 7c43fdf to 8f22300 Compare December 31, 2015 21:51
[PEP 3114](https://www.python.org/dev/peps/pep-3114/) renamed
`iterator.next()` (Python 2) to `iterator.__next__()` (Python 3). The
recommended solution to make the code work in both Python 2 and 3 is to
call the global `next` function.

To use this recommended global `next` function this patch uses a lambda
function to supply `tokenize.generate_tokens` a callable function as it was previously.

This should be functionally equivalent to the old code with the added
benefit of working on both Python 2 and 3.
The `StringIO` and `cStringIO` modules are gone in Python 3. The
recommendation is to import the `io` module on Python 3. Therefore, this
patch first attempts to import the Python 2 module, `cStringIO`, and if
that fails then attempts to import the Python 3 module, `io`.

**NOTE**: There are still other Python 3.x fixes necessary to make `gyb`
run on a Python 3.x interpreter. This is just one small incremental
patch on the way there.
In Python 2 the syntax to catch exceptions was:

except (Exception1, Exception2), target:

In Python 3 the syntax to catch exceptions is:

except (Exception1, Exception2) as target:

This newer Python 3 syntax is also available in 2.6 and 2.7. Therefore,
it is preferred for compatibility reasons. Additionally, the target no
longer can be a tuple. This patch refactors the exception handeling code
to be the newer Python 3 exception catch syntax.
All strings are sequences of Unicode characters in Python 3. This is
entirely different than that of Python 2. Python 2's strings were of
bytes. However, Python 2 does have the concept of Unicode strings. This
patch changes the behavior of the file reader to use the same the codecs
module on Python 2 to properly read a string into a unicode string. From
there the strings are meant to be equivalent on 2 and 3. The rest of the
patch just updates the code to natively work with unicode strings.

To test the class `GraphemeClusterBreakPropertyTable`:

    $ python2 utils/gyb --test \
    -DunicodeGraphemeBreakPropertyFile=./utils/UnicodeData/GraphemeBreakProperty.txt \
    -DunicodeGraphemeBreakTestFile=./utils/UnicodeData/GraphemeBreakTest.txt \
    -DCMAKE_SIZEOF_VOID_P=8 \
    -o /tmp/UnicodeExtendedGraphemeClusters.cpp.2.7.tmp \
    ./stdlib/public/stubs/UnicodeExtendedGraphemeClusters.cpp.gyb

    $ python3 utils/gyb --test \
    -DunicodeGraphemeBreakPropertyFile=./utils/UnicodeData/GraphemeBreakProperty.txt \
    -DunicodeGraphemeBreakTestFile=./utils/UnicodeData/GraphemeBreakTest.txt \
    -DCMAKE_SIZEOF_VOID_P=8 \
    -o /tmp/UnicodeExtendedGraphemeClusters.cpp.3.5.tmp \
    ./stdlib/public/stubs/UnicodeExtendedGraphemeClusters.cpp.gyb

    $ diff -u /tmp/UnicodeExtendedGraphemeClusters.cpp.2.7.tmp \
    /tmp/UnicodeExtendedGraphemeClusters.cpp.3.5.tmp

To test the method `get_grapheme_cluster_break_tests_as_UTF8`:

    $ python2 utils/gyb --test \
    -DunicodeGraphemeBreakPropertyFile=./utils/UnicodeData/GraphemeBreakProperty.txt \
    -DunicodeGraphemeBreakTestFile=./utils/UnicodeData/GraphemeBreakTest.txt \
    -DCMAKE_SIZEOF_VOID_P=8 \
    -o /tmp/UnicodeGraphemeBreakTest.cpp.2.7.tmp \
    ./unittests/Basic/UnicodeGraphemeBreakTest.cpp.gyb

    $ python3 utils/gyb --test \
    -DunicodeGraphemeBreakPropertyFile=./utils/UnicodeData/GraphemeBreakProperty.txt \
    -DunicodeGraphemeBreakTestFile=./utils/UnicodeData/GraphemeBreakTest.txt \
    -DCMAKE_SIZEOF_VOID_P=8 \
    -o /tmp/UnicodeGraphemeBreakTest.cpp.3.5.tmp \
    ./unittests/Basic/UnicodeGraphemeBreakTest.cpp.gyb

    $ diff -u /tmp/UnicodeGraphemeBreakTest.cpp.2.7.tmp \
    /tmp/UnicodeGraphemeBreakTest.cpp.3.5.tmp
PEP 3106 [1] changed the behavior of the dictionaries `items` method.
In Python 2, `items` builds a real list of tuples where `iteritems`
returns a generator. PEP 3106 changes Python 3's `items` method to be
equivalent to Python 2's `iteritems` and completely removes `iteritems`
in Python 3.

This patch switches to both to use `items`. This could have a negative
impact on Python 2's performance because it now causes the dictionary
tuples to be built in memory.

[1] https://www.python.org/dev/peps/pep-3106/
Python 2's map function [1] returns a list by default. Compared with
Python 3's map function [2] which returns an iterator (or map object).
The former is subscriptable, while the latter is not.

This patch explicitly converts the result of some map operations to be
a list. That way they have the same intended behaviour on both Python 2
and 3.

[1] https://docs.python.org/2/library/functions.html#map
[2] https://docs.python.org/3/library/functions.html#map
The Popen command on Python returns a byte sequence on stdout by
default. However by sending the constructor the argument
`universal_newlines=True` it forces the Popen to put a string on stdout.

This was not a problem on Python 2 because the Python 2 regex engine
seemed to work find on byte sequences where Python 3's does not. By
explicitly converting everything to a string the same behavior is now
seen on Python 2 and 3.

See: https://docs.python.org/2/library/subprocess.html#frequently-used-arguments
See: https://docs.python.org/3/library/subprocess.html#frequently-used-arguments
@RLovelett
Copy link
Contributor Author

@dabrahams I've updated (7700b3b) per your comments.

@dabrahams
Copy link
Contributor

@RLovelett Thanks!

dabrahams pushed a commit that referenced this pull request Dec 31, 2015
[gyb] Python 2 or 3 compatible Generate Your Boilerplate
@dabrahams dabrahams merged commit b50b541 into swiftlang:master Dec 31, 2015
@RLovelett RLovelett deleted the gyb-python-3 branch December 31, 2015 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants