support pre-parsed lxml.etree and faster json #37

codinguncut · 2017-02-23T10:58:57Z

please add functions to provide a pre-parsed lxml.etree instead of htmlstring.
Also, using a library such as "ujson" may significantly speedup processing for jsonld.

redapple · 2017-02-23T11:32:22Z

@codinguncut ,
Although it's not documented (nor explicitly tested), 2 of the extractors already support passing an lxml document directly (e.g. result of an lxml parser's .fromstring(), which is how it's implemented for .extract()):

The RDFa extractor is a bit different since rdflib is tricked into thinking it is handling an xml.dom tree, but the lxml parser is available: extruct.rdfa.XmlDomHTMLParser and a method can be added to pass an xml.dom compatible tree.

Regarding speeding up JSON parsing, is usjon the best option these day? (honest question, I haven't used it in a long time)

codinguncut · 2017-03-05T19:37:27Z

Hi,
thank you for sharing this (undocumented) functionality.

I'm not sure if ujson is the "best" option (whatever that means), but it's significantly faster than vanilla json (especially on py2) and I've been using it as a robust drop-in replacement.

http://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/

redapple · 2017-03-06T10:03:43Z

I meant "best" as "fastest" as this is one of your points.
PRs for trying ujson if available and documentation of extractors methods are welcome.

redapple · 2017-06-13T09:30:56Z

extract_items(document, url, *args, **kwargs) methods have been added to all extractors, taking an lxml-parsed document as input.
I've moved the usjon feature request to another issue #49

redapple mentioned this issue Mar 30, 2017

Add extract_items() to RDFaExtractor #40

Closed

redapple mentioned this issue Jun 13, 2017

JSON-LD: Use UltraJSON if available #49

Open

redapple closed this as completed Jun 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support pre-parsed lxml.etree and faster json #37

support pre-parsed lxml.etree and faster json #37

codinguncut commented Feb 23, 2017

redapple commented Feb 23, 2017 •

edited

Loading

Uh oh!

codinguncut commented Mar 5, 2017

Uh oh!

redapple commented Mar 6, 2017

Uh oh!

redapple commented Jun 13, 2017

Uh oh!

support pre-parsed lxml.etree and faster json #37

support pre-parsed lxml.etree and faster json #37

Comments

codinguncut commented Feb 23, 2017

redapple commented Feb 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codinguncut commented Mar 5, 2017

Uh oh!

redapple commented Mar 6, 2017

Uh oh!

redapple commented Jun 13, 2017

Uh oh!

redapple commented Feb 23, 2017 •

edited

Loading