Skip to content

support pre-parsed lxml.etree and faster json #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
codinguncut opened this issue Feb 23, 2017 · 4 comments
Closed

support pre-parsed lxml.etree and faster json #37

codinguncut opened this issue Feb 23, 2017 · 4 comments

Comments

@codinguncut
Copy link

please add functions to provide a pre-parsed lxml.etree instead of htmlstring.
Also, using a library such as "ujson" may significantly speedup processing for jsonld.

@redapple
Copy link
Contributor

redapple commented Feb 23, 2017

@codinguncut ,
Although it's not documented (nor explicitly tested), 2 of the extractors already support passing an lxml document directly (e.g. result of an lxml parser's .fromstring(), which is how it's implemented for .extract()):

The RDFa extractor is a bit different since rdflib is tricked into thinking it is handling an xml.dom tree, but the lxml parser is available: extruct.rdfa.XmlDomHTMLParser and a method can be added to pass an xml.dom compatible tree.

Regarding speeding up JSON parsing, is usjon the best option these day? (honest question, I haven't used it in a long time)

@codinguncut
Copy link
Author

Hi,
thank you for sharing this (undocumented) functionality.

I'm not sure if ujson is the "best" option (whatever that means), but it's significantly faster than vanilla json (especially on py2) and I've been using it as a robust drop-in replacement.

http://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/

@redapple
Copy link
Contributor

redapple commented Mar 6, 2017

I meant "best" as "fastest" as this is one of your points.
PRs for trying ujson if available and documentation of extractors methods are welcome.

@redapple
Copy link
Contributor

extract_items(document, url, *args, **kwargs) methods have been added to all extractors, taking an lxml-parsed document as input.
I've moved the usjon feature request to another issue #49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants