Skip to content

Import bbayles/what-url #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 18, 2023
Merged

Import bbayles/what-url #1

merged 12 commits into from
May 18, 2023

Conversation

bbayles
Copy link
Collaborator

@bbayles bbayles commented May 16, 2023

Re: ada-url/ada#408, this PR brings in the code from bbayles/what-url.

We can discuss what needs to be added, subtracted, or changed. Here was my to-do list:

  • Build Python wheels for Windows
  • Make the Mac OS wheels portable (with delocate)
  • Make the Linux wheels more compatible (i.e. build them on a machine with an older toolchain)
  • Add benchmarks
  • Rename the package from what_url to ada_url

I came up with the list of exposed functions based on what I thought would be useful/familiar to Python users, but we can add/subtract/change as needed.

@anonrig anonrig requested a review from lemire May 16, 2023 21:31
@lemire
Copy link
Member

lemire commented May 16, 2023

@TkTech @Ezibenroc Could you have a look? This aims to wrap the Ada URL parsing library... https://github.com/ada-url/ada

Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what I would consider to be a wrapper around Ada per se. I think it is best described as @bbayles does... a URL joining library. A wrapper would proceed as the Rust and Go does, and offer a WHATWG interface...

https://github.com/ada-url/goada

https://github.com/ada-url/rust

This is not what this Python code does.

It is fine for what it does, but I think we will need a wrapper and as such we should probably reserve the name ada_url.

Maybe this python package should be called ada_url_join?

And then we will want later to produce ada_url which would follow the Go, Rust, JavaScript/Node model and offer setters/getters.

@bbayles
Copy link
Collaborator Author

bbayles commented May 17, 2023

What about the idea of having a low level module that exposes the ADA C functions more or less as-is, and a higher level module that has functions like the ones I have already?

I slightly object to the idea that this is just for URL joining as-is, because the parse_url and replace_url functions are there to do extract and modify.

@bbayles
Copy link
Collaborator Author

bbayles commented May 17, 2023

I went ahead and added a URL class that should more faithfully match the one described by the WHATWG spec. Check it out!

@lemire
Copy link
Member

lemire commented May 18, 2023

I like the new URL class a lot. Coud __str__() return something useful? Can __dir__() return the attributes? Regarding the latter, my editor seems to use __dir__() for autocompletion so it is somewhat important. With the new instructions, I get clean builds and everything which is fantastic.

@anonrig I think we need to rename the repository. It could be anything, but python is an unfortunate choice. I think GitHub will let us rename (at least once).

Copy link
Member

@anonrig anonrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we publish to pip? Who ever does it, can you make sure @lemire and I have admin access.


steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we install Ruff and test the linter as well in a different workflow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can split the workflow into different pieces, sure.

readme_dst = os.path.join(build_dir, 'README.pprst')
shutil.copyfile(readme_src, readme_dst)

project = 'ada-url/python'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not forget to change this when we rename the repo

@bbayles
Copy link
Collaborator Author

bbayles commented May 18, 2023

Re: publishing to PyPI, I am happy to do the deed and make people here admins for the repository as well. I currently maintain several packages there. Please let me know your pypi user names if you'd like me to do this.

@lemire
Copy link
Member

lemire commented May 18, 2023

I think it is good if @bbayles does it.

@lemire
Copy link
Member

lemire commented May 18, 2023

I think we need help and having someone who can handle the updates is great.

@bbayles My profile is here: https://pypi.org/user/lemire/

@lemire
Copy link
Member

lemire commented May 18, 2023

@anonrig Do you want to change the name of the repo? Either now or after merging this ?

Note that I think that this PR can be merged now. @anonrig : want to do it when you think it is ok?

@anonrig anonrig merged commit 32c92cc into ada-url:main May 18, 2023
@bbayles bbayles deleted the bbayles-what-url branch May 18, 2023 02:12
@bbayles
Copy link
Collaborator Author

bbayles commented May 18, 2023

I will publish the PyPI package after we do the repository rename.

I will also publish docs to https://readthedocs.org/ - I'll make the same people admin for the project.

Could I be added as a contributor for this repo as well?

Thanks for accepting this PR - I'm happy to have a fast URL library for Python.

@anonrig
Copy link
Member

anonrig commented May 18, 2023

I renamed it and invited you as a collaborator @bbayles. Thank you for your contribution.

@anonrig
Copy link
Member

anonrig commented May 18, 2023

@bbayles I've created an account on PyPi. My username is anonrig.

@lemire
Copy link
Member

lemire commented May 18, 2023

I had taken the liberty of inviting @bbayles to the org yesterday.

@TkTech
Copy link

TkTech commented May 18, 2023

@TkTech @Ezibenroc Could you have a look? This aims to wrap the Ada URL parsing library... https://github.com/ada-url/ada

@lemire I took a quick stab over breakfast and tossed up my attempt here: https://github.com/TkTech/can_ada. It's a very thin layer over the library and should require minimal changes with library updates. It releases binary wheels for x86, ARM, PowerPC, IBM Z, etc...

I believe this approach is simpler. Feel free to yoink anything to use in this repo if you find it useful.

@lemire
Copy link
Member

lemire commented May 18, 2023

@TkTech @anonrig @bbayles

It looks like @TkTech 's wrapper is faster... Can you guys check this out?

$ pip3 install can_ada
$ pip3 install ada_url
$ python3 -m timeit -s 'import can_ada' 'can_ada.parse("https://tkte.ch/search?q=canada")'
1000000 loops, best of 5: 386 nsec per loop
$ python3 -m timeit -s 'import ada_url' 'ada_url.URL("https://tkte.ch/search?q=canada")'
500000 loops, best of 5: 558 nsec per loop

I am also finding that it does a better job at exposing the attributes...

>>> import can_ada
>>> dir(can_ada.parse("https://tkte.ch/search?q=canada"))
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'has_credentials', 'has_empty_hostname', 'has_hash', 'has_hostname', 'has_non_empty_password', 'has_non_empty_username', 'has_password', 'has_port', 'has_search', 'has_valid_domain', 'hash', 'host', 'hostname', 'href', 'origin', 'password', 'pathname', 'pathname_length', 'port', 'protocol', 'search', 'to_diagram', 'username', 'validate']
>>> import ada_url
>>> dir(ada_url.URL("https://tkte.ch/search?q=canada"))
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'can_parse', 'close', 'urlobj']

@lemire
Copy link
Member

lemire commented May 18, 2023

Ranked by performance... I have can_ada, the standard library (urllib) and then ada_url. I am finding that ada_url is 50% slower than can_ada.

$ python3 -m timeit -s 'import can_ada' 'can_ada.parse("https://tkte.ch/search?q=canada")'
1000000 loops, best of 5: 378 nsec per loop
$ python3 -m timeit -s 'import urllib' 'urllib.parse.urlparse("https://tkte.ch/search?q=canada")'
500000 loops, best of 5: 491 nsec per loop
$ python3 -m timeit -s 'import ada_url' 'ada_url.URL("https://tkte.ch/search?q=canada")'
500000 loops, best of 5: 575 nsec per loop

@TkTech
Copy link

TkTech commented May 18, 2023

Keep in mind that the majority of the time on micro-benchmarks like this is spent in the pybind11 magic wrapper glue. It can be made quite a bit faster (often by an order of magnitude, but it varies) by ditching pybind11 and doing all the wrapper work yourself using the plain Cython API.

However, that will make the wrapper 1k+ lines, rather than under 100. For a quick proof of concept I did not want to spend that effort. A pybind11-based binding is a good balance of performance and ease of development.

@bbayles
Copy link
Collaborator Author

bbayles commented May 18, 2023

I will fix the __dir__ listing shortly.

To see what's fast/slow we can look at profiling.

@bbayles
Copy link
Collaborator Author

bbayles commented May 18, 2023

Also note that ada_url.URL is a more involved constructor than can_parse. A closer test would be:

import ada_url
ada_url.URL.can_parse("https://tkte.ch/search?q=canada")

@lemire
Copy link
Member

lemire commented May 18, 2023

A closer test would be:

Maybe there is a confusion. Try can_ada.parse("https://tkte.ch/search?q=canada"). It returns what is effectively a URL instance. The name can_ada has probably to do with the fact that @TkTech lives in Canada. It is not related to can_parse.

@lemire
Copy link
Member

lemire commented May 18, 2023

Keep in mind that the majority of the time on micro-benchmarks like this is spent in the pybind11 magic wrapper glue

I suspect that the C++ time is probably in the 150 ns to 200 ns range. So, probably, half the time is spent on magic wrapper glue. Beating urllib is great though.

@lemire
Copy link
Member

lemire commented May 18, 2023

I think that that can_ada.parse and ada_url.URL are functionally comparable. Both construct and expose an url instance.

@TkTech
Copy link

TkTech commented May 18, 2023

The name can_ada has probably to do with the fact that @TkTech lives in Canada.

Yes, this is just me being cheeky, and naming projects is hard :) I actually live about a 30 minute walk from Lemire.

@bbayles
Copy link
Collaborator Author

bbayles commented May 18, 2023

For what it's worth, on this dataset I get:

  • 0.2290 seconds to get through with ada_url.URL
  • 0.4666 seconds to get through with urllib.parse.urlparse
from urllib.parse import urlparse
from time import perf_counter

start_time = perf_counter()
total = 0
with open('/tmp/out.txt', 'rt') as f:
    for line in f:
        try:
            urlparse(line)
        except Exception:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')
from ada_url import URL
from time import perf_counter

start_time = perf_counter()
total = 0
with open('/tmp/out.txt', 'rt') as f:
    for line in f:
        try:
            with URL(line) as urlobj:
                pass
        except ValueError:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')

@lemire
Copy link
Member

lemire commented May 18, 2023

I ran the following...

import can_ada
from urllib.parse import urlparse
from ada_url import URL
from time import perf_counter
import urllib.request
import os
if not os.path.exists('top100.txt'):
    urllib.request.urlretrieve("https://github.com/ada-url/url-various-datasets/raw/main/top100/top100.txt", "top100.txt")



print("can_ada")

start_time = perf_counter()
total = 0
with open('top100.txt', 'rt') as f:
    for line in f:
        try:
            can_ada.parse(line)
        except Exception:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')

print("urllib")

start_time = perf_counter()
total = 0
with open('top100.txt', 'rt') as f:
    for line in f:
        try:
            urlparse(line)
        except Exception:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')

print("ada_url")

start_time = perf_counter()
total = 0
with open('top100.txt', 'rt') as f:
    for line in f:
        try:
            with URL(line) as urlobj:
                pass
        except ValueError:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')

I get...

can_ada
99999	0.06756645790301263
urllib
100031	0.23395362496376038
ada_url
99999	0.11724529205821455

So, basically, can_adais 4 times faster than urllib whereas ada_url is twice as fast as urllib.

To see how many millions of URLs per second this is, do... 0.1/x. So

  • can_ada parses 1.5 millions URLs per second
  • the standard library is at 0.4 million URLs per second
  • ada_url is at 0.85 million URLs per second

@bbayles
Copy link
Collaborator Author

bbayles commented May 18, 2023

Here's a profiling view:
image

Almost 20% of the time for ada_url is spent making sure the C ada_url object is freed properly. I presume can_ada does that too?

@lemire
Copy link
Member

lemire commented May 18, 2023

Node.js 20 can reach 3 million URLs per second on the same machine and the same dataset. So Node.js is twice as fast can_ada, and about four times as far as ada_url.

Of course, @anonrig was careful in the construction of the Node.js wrapper.

@anonrig
Copy link
Member

anonrig commented May 18, 2023

i believe most of the performance diff is caused by utf8 encoding and decoding.

@bbayles
Copy link
Collaborator Author

bbayles commented May 18, 2023

If I force ada_url.URL to accept bytes objects instead of str objects, I get 0.1807 vs. the 0.2290 from before.

Per the profiling, encode() is about 3% of the total time.

@TkTech
Copy link

TkTech commented May 18, 2023

Almost 20% of the time for ada_url is spent making sure the C ada_url object is freed properly. I presume can_ada does that too?

can_ada doesn't need to do this, since we're not wrapping the C API. We're putting a thin python object around the C++ object, and letting the C++ object lifecycle do its thing when the python object wrapping it gets garbage collected.

@bbayles
Copy link
Collaborator Author

bbayles commented May 18, 2023

Makes sense. ada_url also calls the C function ada_is_valid after parsing. I think can_ada doesn't need to do this, because the C++ object will reflect the status
(here)?

@TkTech
Copy link

TkTech commented May 18, 2023

In that case, the status is held in the ada::result object, which we check and then discard to get the ada::url_aggregator which contains the actual data. So yes :)

@TkTech
Copy link

TkTech commented May 18, 2023

If I force ada_url.URL to accept bytes objects instead of str objects, I get 0.1807 vs. the 0.2290 from before.

Per the profiling, encode() is about 3% of the total time.

This works for can_ada as well. As an encoded str:

python3 -m timeit -s 'import can_ada' 'can_ada.parse("https://tkte.ch/search?q=canada")'
500000 loops, best of 5: 441 nsec per loop

or as a byte object:

python3 -m timeit -s 'import can_ada' 'can_ada.parse(b"https://tkte.ch/search?q=canada")'
1000000 loops, best of 5: 384 nsec per loop

Either is fine. We can actually take a lot of shortcuts here, since anything supporting the buffer protocol will work. For example you could pass in a raw memoryview() of the data read from a socket.

@bbayles
Copy link
Collaborator Author

bbayles commented May 18, 2023

ada_url uses .encode() because CFFI needs it to interface with the (ref) char* input to ada_parse:

Python 3 is supported, but the main point to note is that the char C type corresponds to the bytes Python type, and not str. It is your responsibility to encode/decode all Python strings to bytes when passing them to or receiving them from CFFI.

The encoding is a small part of the overall time, however, and so is probably not worth focusing on.


I may be able to avoid the __enter__ and __exit__ calls by taking advantage of CFFI's object lifecycle management:

Unlike C, the returned pointer object has ownership on the allocated memory: when this exact object is garbage-collected, then the memory is freed. If, at the level of C, you store a pointer to the memory somewhere else, then make sure you also keep the object alive for as long as needed. (This also applies if you immediately cast the returned pointer to a pointer of a different type: only the original object has ownership, so you must keep it alive. As soon as you forget it, then the casted pointer will point to garbage! In other words, the ownership rules are attached to the wrapper cdata objects: they are not, and cannot, be attached to the underlying raw memory.)

I think as long as I have a reference to the urlobj attribute this will be satisfied.

@lemire lemire mentioned this pull request Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants