Import bbayles/what-url #1

bbayles · 2023-05-16T21:28:58Z

Re: ada-url/ada#408, this PR brings in the code from bbayles/what-url.

We can discuss what needs to be added, subtracted, or changed. Here was my to-do list:

Build Python wheels for Windows
Make the Mac OS wheels portable (with delocate)
Make the Linux wheels more compatible (i.e. build them on a machine with an older toolchain)
Add benchmarks
Rename the package from what_url to ada_url

I came up with the list of exposed functions based on what I thought would be useful/familiar to Python users, but we can add/subtract/change as needed.

.github/workflows/build_test.yml

Makefile

pyproject.toml

setup.cfg

what_url/ada_adapter.py

lemire · 2023-05-16T21:56:22Z

@TkTech @Ezibenroc Could you have a look? This aims to wrap the Ada URL parsing library... https://github.com/ada-url/ada

Co-authored-by: Yagiz Nizipli <[email protected]>

Makefile

README.rst

lemire

This is not what I would consider to be a wrapper around Ada per se. I think it is best described as @bbayles does... a URL joining library. A wrapper would proceed as the Rust and Go does, and offer a WHATWG interface...

https://github.com/ada-url/goada

https://github.com/ada-url/rust

This is not what this Python code does.

It is fine for what it does, but I think we will need a wrapper and as such we should probably reserve the name ada_url.

Maybe this python package should be called ada_url_join?

And then we will want later to produce ada_url which would follow the Go, Rust, JavaScript/Node model and offer setters/getters.

bbayles · 2023-05-17T18:48:13Z

What about the idea of having a low level module that exposes the ADA C functions more or less as-is, and a higher level module that has functions like the ones I have already?

I slightly object to the idea that this is just for URL joining as-is, because the parse_url and replace_url functions are there to do extract and modify.

bbayles · 2023-05-17T20:30:12Z

I went ahead and added a URL class that should more faithfully match the one described by the WHATWG spec. Check it out!

lemire · 2023-05-18T00:50:43Z

I like the new URL class a lot. Coud __str__() return something useful? Can __dir__() return the attributes? Regarding the latter, my editor seems to use __dir__() for autocompletion so it is somewhat important. With the new instructions, I get clean builds and everything which is fantastic.

@anonrig I think we need to rename the repository. It could be anything, but python is an unfortunate choice. I think GitHub will let us rename (at least once).

anonrig

How do we publish to pip? Who ever does it, can you make sure @lemire and I have admin access.

anonrig · 2023-05-18T00:53:51Z

.github/workflows/build_test.yml

+
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python ${{ matrix.python }}


Can we install Ruff and test the linter as well in a different workflow?

I can split the workflow into different pieces, sure.

ada_url/ada_adapter.py

anonrig · 2023-05-18T00:57:36Z

docs/conf.py

+readme_dst = os.path.join(build_dir, 'README.pprst')
+shutil.copyfile(readme_src, readme_dst)
+
+project = 'ada-url/python'


Lets not forget to change this when we rename the repo

bbayles · 2023-05-18T01:11:16Z

Re: publishing to PyPI, I am happy to do the deed and make people here admins for the repository as well. I currently maintain several packages there. Please let me know your pypi user names if you'd like me to do this.

lemire · 2023-05-18T01:48:41Z

I think it is good if @bbayles does it.

lemire · 2023-05-18T01:52:05Z

I think we need help and having someone who can handle the updates is great.

@bbayles My profile is here: https://pypi.org/user/lemire/

lemire · 2023-05-18T01:53:08Z

@anonrig Do you want to change the name of the repo? Either now or after merging this ?

Note that I think that this PR can be merged now. @anonrig : want to do it when you think it is ok?

bbayles · 2023-05-18T02:13:57Z

I will publish the PyPI package after we do the repository rename.

I will also publish docs to https://readthedocs.org/ - I'll make the same people admin for the project.

Could I be added as a contributor for this repo as well?

Thanks for accepting this PR - I'm happy to have a fast URL library for Python.

anonrig · 2023-05-18T03:21:31Z

I renamed it and invited you as a collaborator @bbayles. Thank you for your contribution.

anonrig · 2023-05-18T03:24:35Z

@bbayles I've created an account on PyPi. My username is anonrig.

lemire · 2023-05-18T13:42:10Z

I had taken the liberty of inviting @bbayles to the org yesterday.

TkTech · 2023-05-18T15:35:53Z

@TkTech @Ezibenroc Could you have a look? This aims to wrap the Ada URL parsing library... https://github.com/ada-url/ada

@lemire I took a quick stab over breakfast and tossed up my attempt here: https://github.com/TkTech/can_ada. It's a very thin layer over the library and should require minimal changes with library updates. It releases binary wheels for x86, ARM, PowerPC, IBM Z, etc...

I believe this approach is simpler. Feel free to yoink anything to use in this repo if you find it useful.

lemire · 2023-05-18T15:53:56Z

@TkTech @anonrig @bbayles

It looks like @TkTech 's wrapper is faster... Can you guys check this out?

$ pip3 install can_ada
$ pip3 install ada_url
$ python3 -m timeit -s 'import can_ada' 'can_ada.parse("https://tkte.ch/search?q=canada")'
1000000 loops, best of 5: 386 nsec per loop
$ python3 -m timeit -s 'import ada_url' 'ada_url.URL("https://tkte.ch/search?q=canada")'
500000 loops, best of 5: 558 nsec per loop

I am also finding that it does a better job at exposing the attributes...

>>> import can_ada
>>> dir(can_ada.parse("https://tkte.ch/search?q=canada"))
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'has_credentials', 'has_empty_hostname', 'has_hash', 'has_hostname', 'has_non_empty_password', 'has_non_empty_username', 'has_password', 'has_port', 'has_search', 'has_valid_domain', 'hash', 'host', 'hostname', 'href', 'origin', 'password', 'pathname', 'pathname_length', 'port', 'protocol', 'search', 'to_diagram', 'username', 'validate']
>>> import ada_url
>>> dir(ada_url.URL("https://tkte.ch/search?q=canada"))
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'can_parse', 'close', 'urlobj']

lemire · 2023-05-18T15:59:56Z

Ranked by performance... I have can_ada, the standard library (urllib) and then ada_url. I am finding that ada_url is 50% slower than can_ada.

$ python3 -m timeit -s 'import can_ada' 'can_ada.parse("https://tkte.ch/search?q=canada")'
1000000 loops, best of 5: 378 nsec per loop

$ python3 -m timeit -s 'import urllib' 'urllib.parse.urlparse("https://tkte.ch/search?q=canada")'
500000 loops, best of 5: 491 nsec per loop

$ python3 -m timeit -s 'import ada_url' 'ada_url.URL("https://tkte.ch/search?q=canada")'
500000 loops, best of 5: 575 nsec per loop

TkTech · 2023-05-18T16:17:54Z

Keep in mind that the majority of the time on micro-benchmarks like this is spent in the pybind11 magic wrapper glue. It can be made quite a bit faster (often by an order of magnitude, but it varies) by ditching pybind11 and doing all the wrapper work yourself using the plain Cython API.

However, that will make the wrapper 1k+ lines, rather than under 100. For a quick proof of concept I did not want to spend that effort. A pybind11-based binding is a good balance of performance and ease of development.

bbayles · 2023-05-18T16:22:18Z

I will fix the __dir__ listing shortly.

To see what's fast/slow we can look at profiling.

bbayles · 2023-05-18T16:23:47Z

Also note that ada_url.URL is a more involved constructor than can_parse. A closer test would be:

import ada_url
ada_url.URL.can_parse("https://tkte.ch/search?q=canada")

lemire · 2023-05-18T16:26:20Z

A closer test would be:

Maybe there is a confusion. Try can_ada.parse("https://tkte.ch/search?q=canada"). It returns what is effectively a URL instance. The name can_ada has probably to do with the fact that @TkTech lives in Canada. It is not related to can_parse.

lemire · 2023-05-18T16:28:07Z

Keep in mind that the majority of the time on micro-benchmarks like this is spent in the pybind11 magic wrapper glue

I suspect that the C++ time is probably in the 150 ns to 200 ns range. So, probably, half the time is spent on magic wrapper glue. Beating urllib is great though.

lemire · 2023-05-18T16:29:07Z

I think that that can_ada.parse and ada_url.URL are functionally comparable. Both construct and expose an url instance.

TkTech · 2023-05-18T16:30:41Z

The name can_ada has probably to do with the fact that @TkTech lives in Canada.

Yes, this is just me being cheeky, and naming projects is hard :) I actually live about a 30 minute walk from Lemire.

bbayles · 2023-05-18T16:42:42Z

For what it's worth, on this dataset I get:

0.2290 seconds to get through with ada_url.URL
0.4666 seconds to get through with urllib.parse.urlparse

from urllib.parse import urlparse
from time import perf_counter

start_time = perf_counter()
total = 0
with open('/tmp/out.txt', 'rt') as f:
    for line in f:
        try:
            urlparse(line)
        except Exception:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')

from ada_url import URL
from time import perf_counter

start_time = perf_counter()
total = 0
with open('/tmp/out.txt', 'rt') as f:
    for line in f:
        try:
            with URL(line) as urlobj:
                pass
        except ValueError:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')

lemire · 2023-05-18T16:57:52Z

I ran the following...

import can_ada
from urllib.parse import urlparse
from ada_url import URL
from time import perf_counter
import urllib.request
import os
if not os.path.exists('top100.txt'):
    urllib.request.urlretrieve("https://github.com/ada-url/url-various-datasets/raw/main/top100/top100.txt", "top100.txt")



print("can_ada")

start_time = perf_counter()
total = 0
with open('top100.txt', 'rt') as f:
    for line in f:
        try:
            can_ada.parse(line)
        except Exception:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')

print("urllib")

start_time = perf_counter()
total = 0
with open('top100.txt', 'rt') as f:
    for line in f:
        try:
            urlparse(line)
        except Exception:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')

print("ada_url")

start_time = perf_counter()
total = 0
with open('top100.txt', 'rt') as f:
    for line in f:
        try:
            with URL(line) as urlobj:
                pass
        except ValueError:
            pass
        else:
            total +=1

end_time = perf_counter()
print(total, end_time - start_time, sep='\t')

I get...

can_ada
99999	0.06756645790301263
urllib
100031	0.23395362496376038
ada_url
99999	0.11724529205821455

So, basically, can_adais 4 times faster than urllib whereas ada_url is twice as fast as urllib.

To see how many millions of URLs per second this is, do... 0.1/x. So

can_ada parses 1.5 millions URLs per second
the standard library is at 0.4 million URLs per second
ada_url is at 0.85 million URLs per second

bbayles · 2023-05-18T17:00:15Z

Here's a profiling view:

Almost 20% of the time for ada_url is spent making sure the C ada_url object is freed properly. I presume can_ada does that too?

lemire · 2023-05-18T17:00:23Z

Node.js 20 can reach 3 million URLs per second on the same machine and the same dataset. So Node.js is twice as fast can_ada, and about four times as far as ada_url.

Of course, @anonrig was careful in the construction of the Node.js wrapper.

anonrig · 2023-05-18T17:01:55Z

i believe most of the performance diff is caused by utf8 encoding and decoding.

bbayles · 2023-05-18T17:05:31Z

If I force ada_url.URL to accept bytes objects instead of str objects, I get 0.1807 vs. the 0.2290 from before.

Per the profiling, encode() is about 3% of the total time.

TkTech · 2023-05-18T17:31:13Z

Almost 20% of the time for ada_url is spent making sure the C ada_url object is freed properly. I presume can_ada does that too?

can_ada doesn't need to do this, since we're not wrapping the C API. We're putting a thin python object around the C++ object, and letting the C++ object lifecycle do its thing when the python object wrapping it gets garbage collected.

bbayles · 2023-05-18T17:33:32Z

Makes sense. ada_url also calls the C function ada_is_valid after parsing. I think can_ada doesn't need to do this, because the C++ object will reflect the status
(here)?

TkTech · 2023-05-18T17:47:34Z

In that case, the status is held in the ada::result object, which we check and then discard to get the ada::url_aggregator which contains the actual data. So yes :)

TkTech · 2023-05-18T17:49:07Z

If I force ada_url.URL to accept bytes objects instead of str objects, I get 0.1807 vs. the 0.2290 from before.

Per the profiling, encode() is about 3% of the total time.

This works for can_ada as well. As an encoded str:

python3 -m timeit -s 'import can_ada' 'can_ada.parse("https://tkte.ch/search?q=canada")'
500000 loops, best of 5: 441 nsec per loop

or as a byte object:

python3 -m timeit -s 'import can_ada' 'can_ada.parse(b"https://tkte.ch/search?q=canada")'
1000000 loops, best of 5: 384 nsec per loop

Either is fine. We can actually take a lot of shortcuts here, since anything supporting the buffer protocol will work. For example you could pass in a raw memoryview() of the data read from a socket.

bbayles · 2023-05-18T19:23:29Z

ada_url uses .encode() because CFFI needs it to interface with the (ref) char* input to ada_parse:

Python 3 is supported, but the main point to note is that the char C type corresponds to the bytes Python type, and not str. It is your responsibility to encode/decode all Python strings to bytes when passing them to or receiving them from CFFI.

The encoding is a small part of the overall time, however, and so is probably not worth focusing on.

I may be able to avoid the __enter__ and __exit__ calls by taking advantage of CFFI's object lifecycle management:

Unlike C, the returned pointer object has ownership on the allocated memory: when this exact object is garbage-collected, then the memory is freed. If, at the level of C, you store a pointer to the memory somewhere else, then make sure you also keep the object alive for as long as needed. (This also applies if you immediately cast the returned pointer to a pointer of a different type: only the original object has ownership, so you must keep it alive. As soon as you forget it, then the casted pointer will point to garbage! In other words, the ownership rules are attached to the wrapper cdata objects: they are not, and cannot, be attached to the underlying raw memory.)

I think as long as I have a reference to the urlobj attribute this will be satisfied.

Import bbayles/what-url

b120c62

anonrig requested a review from lemire May 16, 2023 21:31

anonrig reviewed May 16, 2023

View reviewed changes

bbayles and others added 4 commits May 16, 2023 20:04

Add concurrency to GHA

8076e7e

Co-authored-by: Yagiz Nizipli <[email protected]>

Update setup.cfg

4c84f39

Co-authored-by: Yagiz Nizipli <[email protected]>

Rename from what-url to ada-url

37da13d

Update GHA conditions

2e8072e

WojciechMula reviewed May 17, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

WojciechMula reviewed May 17, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

bbayles added 3 commits May 17, 2023 11:20

Add script to update the Ada single header package

a806b87

Use for compiling

096e133

Makefile variable for rm

c9a2385

lemire reviewed May 17, 2023

View reviewed changes

README.rst Show resolved Hide resolved

lemire reviewed May 17, 2023

View reviewed changes

README.rst Show resolved Hide resolved

lemire requested changes May 17, 2023

View reviewed changes

Add WHATWG URL class

07c9a94

bbayles added 2 commits May 17, 2023 15:52

Add buil instructions

232afc3

Example for replace_url

7c8d507

lemire approved these changes May 18, 2023

View reviewed changes

anonrig approved these changes May 18, 2023

View reviewed changes

Fix typo in error message

fffac1b

anonrig merged commit 32c92cc into ada-url:main May 18, 2023

bbayles deleted the bbayles-what-url branch May 18, 2023 02:12

bbayles mentioned this pull request May 18, 2023

Use ffi.gc to handle object lifecycle #5

Merged

lemire mentioned this pull request Mar 9, 2024

Add benchmarks #59

Closed

Import bbayles/what-url #1

Import bbayles/what-url #1

Conversation

bbayles commented May 16, 2023 • edited Loading

lemire commented May 16, 2023

lemire left a comment

Choose a reason for hiding this comment

bbayles commented May 17, 2023

bbayles commented May 17, 2023

lemire commented May 18, 2023

anonrig left a comment

Choose a reason for hiding this comment

anonrig May 18, 2023

Choose a reason for hiding this comment

bbayles May 18, 2023

Choose a reason for hiding this comment

anonrig May 18, 2023

Choose a reason for hiding this comment

bbayles commented May 18, 2023

lemire commented May 18, 2023

lemire commented May 18, 2023

lemire commented May 18, 2023

bbayles commented May 18, 2023

anonrig commented May 18, 2023

anonrig commented May 18, 2023

lemire commented May 18, 2023

TkTech commented May 18, 2023

lemire commented May 18, 2023

lemire commented May 18, 2023

TkTech commented May 18, 2023

bbayles commented May 18, 2023

bbayles commented May 18, 2023 • edited Loading

lemire commented May 18, 2023

lemire commented May 18, 2023

lemire commented May 18, 2023

TkTech commented May 18, 2023

bbayles commented May 18, 2023

lemire commented May 18, 2023

bbayles commented May 18, 2023

lemire commented May 18, 2023

anonrig commented May 18, 2023

bbayles commented May 18, 2023

TkTech commented May 18, 2023

bbayles commented May 18, 2023

TkTech commented May 18, 2023

TkTech commented May 18, 2023 • edited Loading

bbayles commented May 18, 2023

bbayles commented May 16, 2023 •

edited

Loading

bbayles commented May 18, 2023 •

edited

Loading

TkTech commented May 18, 2023 •

edited

Loading