-
Notifications
You must be signed in to change notification settings - Fork 6
Import bbayles/what-url #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@TkTech @Ezibenroc Could you have a look? This aims to wrap the Ada URL parsing library... https://github.com/ada-url/ada |
Co-authored-by: Yagiz Nizipli <[email protected]>
Co-authored-by: Yagiz Nizipli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what I would consider to be a wrapper around Ada per se. I think it is best described as @bbayles does... a URL joining library. A wrapper would proceed as the Rust and Go does, and offer a WHATWG interface...
https://github.com/ada-url/goada
https://github.com/ada-url/rust
This is not what this Python code does.
It is fine for what it does, but I think we will need a wrapper and as such we should probably reserve the name ada_url
.
Maybe this python package should be called ada_url_join
?
And then we will want later to produce ada_url
which would follow the Go, Rust, JavaScript/Node model and offer setters/getters.
What about the idea of having a low level module that exposes the ADA C functions more or less as-is, and a higher level module that has functions like the ones I have already? I slightly object to the idea that this is just for URL joining as-is, because the |
I went ahead and added a |
I like the new URL class a lot. Coud @anonrig I think we need to rename the repository. It could be anything, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we publish to pip? Who ever does it, can you make sure @lemire and I have admin access.
|
||
steps: | ||
- uses: actions/checkout@v3 | ||
- name: Set up Python ${{ matrix.python }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we install Ruff and test the linter as well in a different workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can split the workflow into different pieces, sure.
readme_dst = os.path.join(build_dir, 'README.pprst') | ||
shutil.copyfile(readme_src, readme_dst) | ||
|
||
project = 'ada-url/python' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not forget to change this when we rename the repo
Re: publishing to PyPI, I am happy to do the deed and make people here admins for the repository as well. I currently maintain several packages there. Please let me know your |
I think it is good if @bbayles does it. |
I think we need help and having someone who can handle the updates is great. @bbayles My profile is here: https://pypi.org/user/lemire/ |
I will publish the PyPI package after we do the repository rename. I will also publish docs to https://readthedocs.org/ - I'll make the same people admin for the project. Could I be added as a contributor for this repo as well? Thanks for accepting this PR - I'm happy to have a fast URL library for Python. |
I renamed it and invited you as a collaborator @bbayles. Thank you for your contribution. |
@bbayles I've created an account on PyPi. My username is anonrig. |
I had taken the liberty of inviting @bbayles to the org yesterday. |
@lemire I took a quick stab over breakfast and tossed up my attempt here: https://github.com/TkTech/can_ada. It's a very thin layer over the library and should require minimal changes with library updates. It releases binary wheels for x86, ARM, PowerPC, IBM Z, etc... I believe this approach is simpler. Feel free to yoink anything to use in this repo if you find it useful. |
It looks like @TkTech 's wrapper is faster... Can you guys check this out?
I am also finding that it does a better job at exposing the attributes...
|
Ranked by performance... I have
|
Keep in mind that the majority of the time on micro-benchmarks like this is spent in the pybind11 magic wrapper glue. It can be made quite a bit faster (often by an order of magnitude, but it varies) by ditching pybind11 and doing all the wrapper work yourself using the plain Cython API. However, that will make the wrapper 1k+ lines, rather than under 100. For a quick proof of concept I did not want to spend that effort. A pybind11-based binding is a good balance of performance and ease of development. |
I will fix the To see what's fast/slow we can look at profiling. |
Also note that
|
Maybe there is a confusion. Try |
I suspect that the C++ time is probably in the 150 ns to 200 ns range. So, probably, half the time is spent on magic wrapper glue. Beating |
I think that that |
Yes, this is just me being cheeky, and naming projects is hard :) I actually live about a 30 minute walk from Lemire. |
For what it's worth, on this dataset I get:
from urllib.parse import urlparse
from time import perf_counter
start_time = perf_counter()
total = 0
with open('/tmp/out.txt', 'rt') as f:
for line in f:
try:
urlparse(line)
except Exception:
pass
else:
total +=1
end_time = perf_counter()
print(total, end_time - start_time, sep='\t') from ada_url import URL
from time import perf_counter
start_time = perf_counter()
total = 0
with open('/tmp/out.txt', 'rt') as f:
for line in f:
try:
with URL(line) as urlobj:
pass
except ValueError:
pass
else:
total +=1
end_time = perf_counter()
print(total, end_time - start_time, sep='\t') |
I ran the following... import can_ada
from urllib.parse import urlparse
from ada_url import URL
from time import perf_counter
import urllib.request
import os
if not os.path.exists('top100.txt'):
urllib.request.urlretrieve("https://github.com/ada-url/url-various-datasets/raw/main/top100/top100.txt", "top100.txt")
print("can_ada")
start_time = perf_counter()
total = 0
with open('top100.txt', 'rt') as f:
for line in f:
try:
can_ada.parse(line)
except Exception:
pass
else:
total +=1
end_time = perf_counter()
print(total, end_time - start_time, sep='\t')
print("urllib")
start_time = perf_counter()
total = 0
with open('top100.txt', 'rt') as f:
for line in f:
try:
urlparse(line)
except Exception:
pass
else:
total +=1
end_time = perf_counter()
print(total, end_time - start_time, sep='\t')
print("ada_url")
start_time = perf_counter()
total = 0
with open('top100.txt', 'rt') as f:
for line in f:
try:
with URL(line) as urlobj:
pass
except ValueError:
pass
else:
total +=1
end_time = perf_counter()
print(total, end_time - start_time, sep='\t') I get...
So, basically, To see how many millions of URLs per second this is, do...
|
Node.js 20 can reach 3 million URLs per second on the same machine and the same dataset. So Node.js is twice as fast can_ada, and about four times as far as ada_url. Of course, @anonrig was careful in the construction of the Node.js wrapper. |
i believe most of the performance diff is caused by utf8 encoding and decoding. |
If I force Per the profiling, |
|
Makes sense. |
In that case, the status is held in the |
This works for
or as a
Either is fine. We can actually take a lot of shortcuts here, since anything supporting the buffer protocol will work. For example you could pass in a raw |
The encoding is a small part of the overall time, however, and so is probably not worth focusing on. I may be able to avoid the
I think as long as I have a reference to the |
Re: ada-url/ada#408, this PR brings in the code from bbayles/what-url.
We can discuss what needs to be added, subtracted, or changed. Here was my to-do list:
delocate
)what_url
toada_url
I came up with the list of exposed functions based on what I thought would be useful/familiar to Python users, but we can add/subtract/change as needed.