Skip to content

Commit c6e8340

Browse files
authored
prepare version 2.0.0 (#759)
* prepare version 2.0.0 * update setup and wording * docs: readme and structure * update dependabot and funding * update contributing and history files
1 parent b7bfcc3 commit c6e8340

File tree

9 files changed

+85
-111
lines changed

9 files changed

+85
-111
lines changed

.github/FUNDING.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# These are supported funding model platforms
22

3-
github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
3+
github: [adbar]
44
patreon: # Replace with a single Patreon username
55
open_collective: # Replace with a single Open Collective username
66
ko_fi: adbarbaresi

.github/dependabot.yml

Lines changed: 0 additions & 18 deletions
This file was deleted.

CONTRIBUTING.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,41 @@
11
## How to contribute
22

3-
Thank you for considering contributing to Trafilatura! Your contributions make the software and its documentation better.
3+
Your contributions make the software and its documentation better. A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
44

55

66
There are many ways to contribute, you could:
77

88
* Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content.
9-
* Find bugs and submit bug reports: Help making Trafilatura a robust and versatile tool.
9+
* Find bugs and submit bug reports: Help making Trafilatura an even more robust tool.
1010
* Submit feature requests: Share your feedback and suggestions.
1111
* Write code: Fix bugs or add new features.
1212

1313

1414
Here are some important resources:
1515

1616
* [List of currently open issues](https://github.com/adbar/trafilatura/issues) (no pretention to exhaustivity!)
17-
* [Roadmap and milestones](https://github.com/adbar/trafilatura/milestones)
18-
* [How to Contribute to Open Source](https://opensource.guide/how-to-contribute/)
17+
* [How to contribute to open source](https://opensource.guide/how-to-contribute/)
1918

2019

21-
## Submitting changes
20+
## Testing and evaluating the code
2221

23-
Please send a [GitHub Pull Request to trafilatura](https://github.com/adbar/trafilatura/pull/new/master) with a clear list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).
22+
Here is how you can run the tests and code quality checks:
2423

25-
**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)
24+
- Install the necessary packages with `pip install trafilatura[dev]`
25+
- Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
26+
- Run `mypy` on the directory: `mypy trafilatura/`
27+
- See also the [tests Readme](tests/README.rst) for information on the evaluation benchmark
2628

29+
Pull requests will only be accepted if they there are no errors in pytest and mypy.
2730

28-
A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
31+
If you work on text extraction it is useful to check if performance is equal or better on the benchmark.
2932

3033

31-
## Testing and evaluating the code
34+
## Submitting changes
3235

33-
Here is how you can run the tests if you wish to correct the errors and further improve the code:
36+
Please send a pull request to Trafilatura with a list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).
3437

35-
- Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
36-
- See also the [tests Readme](tests/README.rst) for information on the evaluation
38+
**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)
3739

3840

3941

HISTORY.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
## History / Changelog
22

33

4-
## future v2.0.0
4+
## 2.0.0
55

66
Breaking changes:
77
- Python 3.6 and 3.7 deprecated (#709)
@@ -12,6 +12,7 @@ Breaking changes:
1212
- downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724)
1313
- deprecated graphical user interface now removed (#713)
1414
- extraction: move `max_tree_size` parameter to `settings.cfg` (#742)
15+
- use type hinting (#721, #723, #748)
1516
- see [Python](https://trafilatura.readthedocs.io/en/latest/usage-python.html#deprecations) and [CLI](https://trafilatura.readthedocs.io/en/latest/usage-cli.html#deprecations) deprecations in the docs
1617

1718
Fixes:
@@ -20,11 +21,16 @@ Fixes:
2021
- more robust mapping for conversion to HTML (#721)
2122
- CLI downloads: use all information in settings file (#734)
2223
- downloads: cleaner urllib3 code (#736)
23-
- CLI: print URLs early for feeds and sitemaps with `--list` with @gremid (#744)
24+
- refine table markdown output by @unsleepy22 (#752)
25+
- extraction fix: images in text nodes by @unsleepy22 (#757)
2426

2527
Metadata:
2628
- more robust URL extraction (#710)
2729

30+
Command-line interface:
31+
- CLI: print URLs early for feeds and sitemaps with `--list` with @gremid (#744)
32+
- CLI: add 126 exit code for high error ratio (#747)
33+
2834
Maintenance:
2935
- remove already deprecated functions and args (#716)
3036
- add type hints (#723, #728)
@@ -33,10 +39,12 @@ Maintenance:
3339
- better debug messages in `main_extractor` (#714)
3440
- evaluation: review data, update packages, add magic_html (#731)
3541
- setup: explicit exports through `__all__` (#740)
42+
- tests: extend coverage (#753)
3643

3744
Documentation:
3845
- fix link in `docs/index.html` by @nzw0301 (#711)
3946
- remove docs from published packages (#743)
47+
- update docs (#745)
4048

4149

4250
## 1.12.2

README.md

Lines changed: 22 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -32,15 +32,16 @@ required, the output can be converted to commonly used formats.
3232

3333
Going from HTML bulk to essential parts can alleviate many problems
3434
related to text quality, by **focusing on the actual content**,
35-
**avoiding the noise** caused by recurring elements (headers, footers
36-
etc.), and **making sense of the data** with selected information. The
37-
extractor is designed to be **robust and reasonably fast**, it runs in
38-
production on millions of documents.
35+
**avoiding the noise** caused by recurring elements like headers and footers
36+
and by **making sense of the data and metadata** with selected information.
37+
The extractor strikes a balance between limiting noise (precision) and
38+
including all valid parts (recall). It is **robust and reasonably fast**.
3939

40-
The tool's versatility makes it **useful for quantitative and
41-
data-driven approaches**. It is used in the academic domain and beyond
42-
(e.g. in natural language processing, computational social science,
43-
search engine optimization, and information security).
40+
Trafilatura is [widely used](https://trafilatura.readthedocs.io/en/latest/used-by.html)
41+
and integrated into [thousands of projects](https://github.com/adbar/trafilatura/network/dependents>)
42+
by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like
43+
the Allen Institute, Stanford, the Tokyo Institute of Technology, and
44+
the University of Munich.
4445

4546

4647
### Features
@@ -85,22 +86,6 @@ For more information see the [benchmark section](https://trafilatura.readthedocs
8586
and the [evaluation readme](https://github.com/adbar/trafilatura/blob/master/tests/README.rst)
8687
to run the evaluation with the latest data and packages.
8788

88-
**750 documents, 2236 text & 2250 boilerplate segments (2022-05-18), Python 3.8**
89-
90-
| Python Package | Precision | Recall | Accuracy | F-Score | Diff. |
91-
|----------------|-----------|--------|----------|---------|-------|
92-
| html_text 0.5.2 | 0.529 | **0.958** | 0.554 | 0.682 | 2.2x |
93-
| inscriptis 2.2.0 (html to txt) | 0.534 | **0.959** | 0.563 | 0.686 | 3.5x |
94-
| newspaper3k 0.2.8 | 0.895 | 0.593 | 0.762 | 0.713 | 12x |
95-
| justext 3.0.0 (custom) | 0.865 | 0.650 | 0.775 | 0.742 | 5.2x |
96-
| boilerpy3 1.0.6 (article mode) | 0.814 | 0.744 | 0.787 | 0.777 | 4.1x |
97-
| *baseline (text markup)* | 0.757 | 0.827 | 0.781 | 0.790 | **1x** |
98-
| goose3 3.1.9 | **0.934** | 0.690 | 0.821 | 0.793 | 22x |
99-
| readability-lxml 0.8.1 | 0.891 | 0.729 | 0.820 | 0.801 | 5.8x |
100-
| news-please 1.5.22 | 0.898 | 0.734 | 0.826 | 0.808 | 61x |
101-
| readabilipy 0.2.0 | 0.877 | 0.870 | 0.874 | 0.874 | 248x |
102-
| trafilatura 1.2.2 (standard) | 0.914 | 0.904 | **0.910** | **0.909** | 7.1x |
103-
10489

10590
#### Other evaluations:
10691

@@ -138,7 +123,7 @@ This package is distributed under the [Apache 2.0 license](https://www.apache.or
138123
Versions prior to v1.8.0 are under GPLv3+ license.
139124

140125

141-
## Contributing
126+
### Contributing
142127

143128
Contributions of all kinds are welcome. Visit the [Contributing
144129
page](https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md)
@@ -152,13 +137,17 @@ who extended the docs or submitted bug reports, features and bugfixes!
152137

153138
## Context
154139

155-
Developed with practical applications of academic research in mind, this
156-
software is part of a broader effort to derive information from web
157-
documents. Extracting and pre-processing web texts to the exacting
158-
standards of scientific research presents a substantial challenge. This
159-
software package simplifies text data collection and enhances corpus
160-
quality, it is currently used to build [text databases for linguistic
161-
research](https://www.dwds.de/d/k-web).
140+
This work started as a PhD project at the crossroads of linguistics and
141+
NLP, this expertise has been instrumental in shaping Trafilatura over
142+
the years. Initially launched to create text databases for research purposes
143+
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
144+
this package continues to be maintained but its future development
145+
depends on community support.
146+
147+
**If you value this software or depend on it for your product, consider
148+
sponsoring it and contributing to its codebase**. Your support will
149+
help maintain and enhance this popular package, ensuring its growth,
150+
robustness, and accessibility for developers and users around the world.
162151

163152
*Trafilatura* is an Italian word for [wire
164153
drawing](https://en.wikipedia.org/wiki/Wire_drawing) symbolizing the
@@ -171,11 +160,6 @@ Reach out via ia the software repository or the [contact
171160
page](https://adrien.barbaresi.eu/) for inquiries, collaborations, or
172161
feedback. See also social networks for the latest updates.
173162

174-
This work started as a PhD project at the crossroads of linguistics and
175-
NLP, this expertise has been instrumental in shaping Trafilatura over
176-
the years. It has first been released under its current form in 2019,
177-
its development is referenced in the following publications:
178-
179163
- Barbaresi, A. [Trafilatura: A Web Scraping Library and Command-Line
180164
Tool for Text Discovery and
181165
Extraction](https://aclanthology.org/2021.acl-demo.15/), Proceedings
@@ -212,18 +196,13 @@ acquisition. Here is how to cite it:
212196

213197
### Software ecosystem
214198

215-
Case studies and publications are listed on the [Used By documentation
216-
page](https://trafilatura.readthedocs.io/en/latest/used-by.html).
217-
218199
Jointly developed plugins and additional packages also contribute to the
219200
field of web data extraction and analysis:
220201

221202
<img alt="Software ecosystem" src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/software-ecosystem.png" align="center" width="65%"/>
222203

223204
Corresponding posts can be found on [Bits of
224-
Language](https://adrien.barbaresi.eu/blog/tag/trafilatura.html). The
225-
blog covers a range of topics from technical how-tos, updates on new
226-
features, to discussions on text mining challenges and solutions.
205+
Language](https://adrien.barbaresi.eu/blog/tag/trafilatura.html).
227206

228207
Impressive, you have reached the end of the page: Thank you for your
229208
interest!

docs/index.rst

Lines changed: 25 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ Description
4040

4141
Trafilatura is a **Python package and command-line tool** designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to commonly used formats.
4242

43-
Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the **noise caused by recurring elements** (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to **make sense of the data**. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be **robust and reasonably fast**, it runs in production on millions of documents.
43+
Going from raw HTML to essential parts can alleviate many problems related to text quality, by avoiding the **noise caused by recurring elements** like headers and footers and by **making sense of the data and metadata** with selected information. The extractor strikes a balance between limiting noise (precision) and including all valid parts (recall). It is **robust and reasonably fast**.
4444

45-
This tool can be **useful for quantitative research** in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.
45+
Trafilatura is `widely used <used-by.html>`_ and integrated into `thousands of projects <https://github.com/adbar/trafilatura/network/dependents>`_ by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like the Allen Institute, Stanford, the Tokyo Institute of Technology, and the University of Munich.
4646

4747

4848
Features
@@ -120,25 +120,27 @@ Versions prior to v1.8.0 are under GPLv3+ license.
120120

121121

122122
Contributing
123-
------------
123+
~~~~~~~~~~~~
124124

125125
Contributions of all kinds are welcome. Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated issue page <https://github.com/adbar/trafilatura/issues>`_.
126126

127127
Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who extended the docs or submitted bug reports, features and bugfixes!
128128

129129

130-
Changes
131-
-------
132-
133-
For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.
134-
135-
136130
Context
137131
-------
138132

139-
Originally released to collect data for linguistic research and lexicography at the `Berlin-Brandenburg Academy of Sciences <https://www.dwds.de/d/k-web>`_, Trafilatura is now `widely used <used-by.html>`_.
133+
This work started as a PhD project at the crossroads of linguistics and NLP,
134+
this expertise has been instrumental in shaping Trafilatura over the years.
135+
Initially launched to create text databases for research purposes
136+
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
137+
this package continues to be maintained but its future development
138+
depends on community support.
140139

141-
Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge. These documentation pages also provide information on `concepts behind data collection <background.html>`_ as well as `tutorials <tutorials.html>`_ on how to gather web texts.
140+
**If you value this software or depend on it for your product, consider
141+
sponsoring it and contributing to its codebase**. Your support will
142+
help maintain and enhance this popular package, ensuring its growth,
143+
robustness, and accessibility for developers and users around the world.
142144

143145
*Trafilatura* is an Italian word for `wire drawing <https://en.wikipedia.org/wiki/Wire_drawing>`_ symbolizing the refinement and conversion process. It is also the way shapes of pasta are formed.
144146

@@ -148,9 +150,6 @@ Author
148150

149151
Reach out via the software repository or the `contact page <https://adrien.barbaresi.eu/>`_ for inquiries, collaborations, or feedback. See also social networks for the latest updates.
150152

151-
This work started as a PhD project at the crossroads of linguistics and NLP, this expertise has been instrumental in shaping Trafilatura over the years. It has first been released under its current form in 2019, its development is referenced in the following publications:
152-
153-
154153
- Barbaresi, A. `Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction <https://aclanthology.org/2021.acl-demo.15/>`_, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
155154
- Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
156155
- Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.
@@ -186,16 +185,17 @@ Trafilatura is widely used in the academic domain, chiefly for data acquisition.
186185
Software ecosystem
187186
~~~~~~~~~~~~~~~~~~
188187

189-
Case studies and publications are listed on the `Used By documentation page <used-by.html>`_.
190-
191188
Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis:
192189

193190
.. image:: software-ecosystem.png
194191
:alt: Software ecosystem
195192
:align: center
196193
:width: 65%
197194

198-
Corresponding posts on `Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_ (blog).
195+
Corresponding posts can be found on
196+
`Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_.
197+
The blog covers a range of topics from technical how-tos, updates on new
198+
features, to discussions on text mining challenges and solutions.
199199

200200

201201
Building the docs
@@ -208,6 +208,13 @@ Starting from the ``docs/`` folder of the repository:
208208

209209

210210

211+
Changes
212+
-------
213+
214+
For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.
215+
216+
217+
211218
Further documentation
212219
=====================
213220

@@ -222,4 +229,4 @@ Further documentation
222229
used-by
223230
background
224231

225-
* :ref:`genindex`
232+
:ref:`genindex`

0 commit comments

Comments
 (0)