You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* prepare version 2.0.0
* update setup and wording
* docs: readme and structure
* update dependabot and funding
* update contributing and history files
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+14-12Lines changed: 14 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -1,39 +1,41 @@
1
1
## How to contribute
2
2
3
-
Thank you for considering contributing to Trafilatura! Your contributions make the software and its documentation better.
3
+
Your contributions make the software and its documentation better. A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
4
4
5
5
6
6
There are many ways to contribute, you could:
7
7
8
8
* Improve the documentation: Write tutorials and guides, correct mistakes, or translate existing content.
9
-
* Find bugs and submit bug reports: Help making Trafilatura a robust and versatile tool.
9
+
* Find bugs and submit bug reports: Help making Trafilatura an even more robust tool.
10
10
* Submit feature requests: Share your feedback and suggestions.
11
11
* Write code: Fix bugs or add new features.
12
12
13
13
14
14
Here are some important resources:
15
15
16
16
*[List of currently open issues](https://github.com/adbar/trafilatura/issues) (no pretention to exhaustivity!)
17
-
*[Roadmap and milestones](https://github.com/adbar/trafilatura/milestones)
18
-
*[How to Contribute to Open Source](https://opensource.guide/how-to-contribute/)
17
+
*[How to contribute to open source](https://opensource.guide/how-to-contribute/)
19
18
20
19
21
-
## Submitting changes
20
+
## Testing and evaluating the code
22
21
23
-
Please send a [GitHub Pull Request to trafilatura](https://github.com/adbar/trafilatura/pull/new/master) with a clear list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).
22
+
Here is how you can run the tests and code quality checks:
24
23
25
-
**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)
24
+
- Install the necessary packages with `pip install trafilatura[dev]`
25
+
- Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
26
+
- Run `mypy` on the directory: `mypy trafilatura/`
27
+
- See also the [tests Readme](tests/README.rst) for information on the evaluation benchmark
26
28
29
+
Pull requests will only be accepted if they there are no errors in pytest and mypy.
27
30
28
-
A special thanks to all the [contributors](https://github.com/adbar/trafilatura/graphs/contributors) who have played a part in Trafilatura.
31
+
If you work on text extraction it is useful to check if performance is equal or better on the benchmark.
29
32
30
33
31
-
## Testing and evaluating the code
34
+
## Submitting changes
32
35
33
-
Here is how you can run the tests if you wish to correct the errors and further improve the code:
36
+
Please send a pull request to Trafilatura with a list of what you have done (read more about [pull requests](http://help.github.com/pull-requests/)).
34
37
35
-
- Run `pytest` from trafilatura's directory, or select a particular test suite, for example `realworld_tests.py`, and run `pytest realworld_tests.py` or simply `python3 realworld_tests.py`
36
-
- See also the [tests Readme](tests/README.rst) for information on the evaluation
38
+
**Working on your first Pull Request?** See this tutorial: [How To Create a Pull Request on GitHub](https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github)
Copy file name to clipboardExpand all lines: HISTORY.md
+10-2Lines changed: 10 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
## History / Changelog
2
2
3
3
4
-
## future v2.0.0
4
+
## 2.0.0
5
5
6
6
Breaking changes:
7
7
- Python 3.6 and 3.7 deprecated (#709)
@@ -12,6 +12,7 @@ Breaking changes:
12
12
- downloads: remove `decode` argument in `fetch_url()` → use `fetch_response` instead (#724)
13
13
- deprecated graphical user interface now removed (#713)
14
14
- extraction: move `max_tree_size` parameter to `settings.cfg` (#742)
15
+
- use type hinting (#721, #723, #748)
15
16
- see [Python](https://trafilatura.readthedocs.io/en/latest/usage-python.html#deprecations) and [CLI](https://trafilatura.readthedocs.io/en/latest/usage-cli.html#deprecations) deprecations in the docs
16
17
17
18
Fixes:
@@ -20,11 +21,16 @@ Fixes:
20
21
- more robust mapping for conversion to HTML (#721)
21
22
- CLI downloads: use all information in settings file (#734)
22
23
- downloads: cleaner urllib3 code (#736)
23
-
- CLI: print URLs early for feeds and sitemaps with `--list` with @gremid (#744)
24
+
- refine table markdown output by @unsleepy22 (#752)
25
+
- extraction fix: images in text nodes by @unsleepy22 (#757)
24
26
25
27
Metadata:
26
28
- more robust URL extraction (#710)
27
29
30
+
Command-line interface:
31
+
- CLI: print URLs early for feeds and sitemaps with `--list` with @gremid (#744)
32
+
- CLI: add 126 exit code for high error ratio (#747)
33
+
28
34
Maintenance:
29
35
- remove already deprecated functions and args (#716)
30
36
- add type hints (#723, #728)
@@ -33,10 +39,12 @@ Maintenance:
33
39
- better debug messages in `main_extractor` (#714)
Copy file name to clipboardExpand all lines: docs/index.rst
+25-18Lines changed: 25 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -40,9 +40,9 @@ Description
40
40
41
41
Trafilatura is a **Python package and command-line tool** designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments. It aims at staying **handy and modular**: no database is required, the output can be converted to commonly used formats.
42
42
43
-
Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the **noise caused by recurring elements** (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to **make sense of the data**. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be **robust and reasonably fast**, it runs in production on millions of documents.
43
+
Going from raw HTML to essential parts can alleviate many problems related to text quality, by avoiding the **noise caused by recurring elements** like headers and footersand by **making sense of the data and metadata** with selected information. The extractor strikes a balance between limiting noise (precision) and including all valid parts (recall). It is **robust and reasonably fast**.
44
44
45
-
This tool can be **useful for quantitative research** in corpus linguistics, natural language processing, computational social science and beyond: it is relevant to anyone interested in data science, information extraction, text mining, and scraping-intensive use cases like search engine optimization, business analytics or information security.
45
+
Trafilatura is `widely used <used-by.html>`_ and integrated into `thousands of projects <https://github.com/adbar/trafilatura/network/dependents>`_ by companies like HuggingFace, IBM, and Microsoft Research as well as institutions like the Allen Institute, Stanford, the Tokyo Institute of Technology, and the University of Munich.
46
46
47
47
48
48
Features
@@ -120,25 +120,27 @@ Versions prior to v1.8.0 are under GPLv3+ license.
120
120
121
121
122
122
Contributing
123
-
------------
123
+
~~~~~~~~~~~~
124
124
125
125
Contributions of all kinds are welcome. Visit the `Contributing page <https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md>`_ for more information. Bug reports can be filed on the `dedicated issue page <https://github.com/adbar/trafilatura/issues>`_.
126
126
127
127
Many thanks to the `contributors <https://github.com/adbar/trafilatura/graphs/contributors>`_ who extended the docs or submitted bug reports, features and bugfixes!
128
128
129
129
130
-
Changes
131
-
-------
132
-
133
-
For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.
134
-
135
-
136
130
Context
137
131
-------
138
132
139
-
Originally released to collect data for linguistic research and lexicography at the `Berlin-Brandenburg Academy of Sciences <https://www.dwds.de/d/k-web>`_, Trafilatura is now `widely used <used-by.html>`_.
133
+
This work started as a PhD project at the crossroads of linguistics and NLP,
134
+
this expertise has been instrumental in shaping Trafilatura over the years.
135
+
Initially launched to create text databases for research purposes
136
+
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
137
+
this package continues to be maintained but its future development
138
+
depends on community support.
140
139
141
-
Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge. These documentation pages also provide information on `concepts behind data collection <background.html>`_ as well as `tutorials <tutorials.html>`_ on how to gather web texts.
140
+
**If you value this software or depend on it for your product, consider
141
+
sponsoring it and contributing to its codebase**. Your support will
142
+
help maintain and enhance this popular package, ensuring its growth,
143
+
robustness, and accessibility for developers and users around the world.
142
144
143
145
*Trafilatura* is an Italian word for `wire drawing <https://en.wikipedia.org/wiki/Wire_drawing>`_ symbolizing the refinement and conversion process. It is also the way shapes of pasta are formed.
144
146
@@ -148,9 +150,6 @@ Author
148
150
149
151
Reach out via the software repository or the `contact page <https://adrien.barbaresi.eu/>`_ for inquiries, collaborations, or feedback. See also social networks for the latest updates.
150
152
151
-
This work started as a PhD project at the crossroads of linguistics and NLP, this expertise has been instrumental in shaping Trafilatura over the years. It has first been released under its current form in 2019, its development is referenced in the following publications:
152
-
153
-
154
153
- Barbaresi, A. `Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction <https://aclanthology.org/2021.acl-demo.15/>`_, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
155
154
- Barbaresi, A. "`Generic Web Content Extraction with Open-Source Software <https://hal.archives-ouvertes.fr/hal-02447264/document>`_", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
156
155
- Barbaresi, A. "`Efficient construction of metadata-enhanced web corpora <https://hal.archives-ouvertes.fr/hal-01371704v2/document>`_", Proceedings of the `10th Web as Corpus Workshop (WAC-X) <https://www.sigwac.org.uk/wiki/WAC-X>`_, 2016.
@@ -186,16 +185,17 @@ Trafilatura is widely used in the academic domain, chiefly for data acquisition.
186
185
Software ecosystem
187
186
~~~~~~~~~~~~~~~~~~
188
187
189
-
Case studies and publications are listed on the `Used By documentation page <used-by.html>`_.
190
-
191
188
Jointly developed plugins and additional packages also contribute to the field of web data extraction and analysis:
192
189
193
190
.. image:: software-ecosystem.png
194
191
:alt:Software ecosystem
195
192
:align:center
196
193
:width:65%
197
194
198
-
Corresponding posts on `Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_ (blog).
195
+
Corresponding posts can be found on
196
+
`Bits of Language <https://adrien.barbaresi.eu/blog/tag/trafilatura.html>`_.
197
+
The blog covers a range of topics from technical how-tos, updates on new
198
+
features, to discussions on text mining challenges and solutions.
199
199
200
200
201
201
Building the docs
@@ -208,6 +208,13 @@ Starting from the ``docs/`` folder of the repository:
208
208
209
209
210
210
211
+
Changes
212
+
-------
213
+
214
+
For version history and changes see the `changelog <https://github.com/adbar/trafilatura/blob/master/HISTORY.md>`_.
0 commit comments