@@ -316,20 +316,15 @@ You can download this application from:
316
316
317
317
[ https://www.py4e.com/code3/gmane.zip ] ( https://www.py4e.com/code3/gmane.zip )
318
318
319
- We will be using data from a free email list archiving service called
320
- [ http://www. gmane.org ] ( http://www.gmane.org ) . This service is very popular with open
321
- source projects because it provides a nice searchable archive of their
322
- email activity. They also have a very liberal policy regarding accessing
323
- their data through their API. They have no rate limits, but ask that you
324
- don't overload their service and take only the data you need. You can
325
- read gmane's terms and conditions at this page:
319
+ We will be using data from a free email list archiving service that was called
320
+ * gmane* - the service has since been shut down and for the purposes of this
321
+ course, a partial archive has been kept
322
+ at [ http://mbox.dr-chuck.net ] ( http://mbox.dr-chuck.net ) .
323
+ The gmane service was very popular with open
324
+ source projects because it provided a nice searchable archive of their
325
+ email activity.
326
326
327
- [ http://www.gmane.org/export.php ] ( http://www.gmane.org/export.php )
328
-
329
- * It is very important that you make use of the gmane.org data
330
- responsibly by adding delays to your access of their services and
331
- spreading long-running jobs over a longer period of time. Do not abuse
332
- this free service and ruin it for the rest of us.*
327
+ [ http://mbox.dr-chuck.net/export.php ] ( http://mbox.dr-chuck.net/export.php )
333
328
334
329
When the Sakai email data was spidered using this software, it produced
335
330
nearly a Gigabyte of data and took a number of runs on several days. The
@@ -340,15 +335,15 @@ corpus so you don't have to spider for five days just to run the
340
335
programs. If you download the pre-spidered content, you should still run
341
336
the spidering process to catch up with more recent messages.
342
337
343
- The first step is to spider the gmane repository. The base URL is
338
+ The first step is to spider the repository. The base URL is
344
339
hard-coded in the * gmane.py* and is hard-coded to the
345
340
Sakai developer list. You can spider another repository by changing that
346
341
base url. Make sure to delete the * content.sqlite* file
347
342
if you switch the base url.
348
343
349
344
The * gmane.py* file operates as a responsible caching
350
345
spider in that it runs slowly and retrieves one mail message per second
351
- so as to avoid getting throttled by gmane . It stores all of its data in
346
+ so as to avoid getting throttled. It stores all of its data in
352
347
a database and can be interrupted and restarted as often as needed. It
353
348
may take many hours to pull all the data down. So you may need to
354
349
restart several times.
@@ -358,17 +353,17 @@ messages of the Sakai developer list:
358
353
359
354
~~~~
360
355
How many messages:10
361
- http://download.gmane.org/gmane.comp.cms. sakai.devel/51410/51411 9460
356
+ http://mbox.dr-chuck.net/ sakai.devel/51410/51411 9460
362
357
[email protected] 2013-04-05 re: [building ...
363
- http://download.gmane.org/gmane.comp.cms. sakai.devel/51411/51412 3379
358
+ http://mbox.dr-chuck.net/ sakai.devel/51411/51412 3379
364
359
[email protected] 2013-04-06 re: [building ...
365
- http://download.gmane.org/gmane.comp.cms. sakai.devel/51412/51413 9903
360
+ http://mbox.dr-chuck.net/ sakai.devel/51412/51413 9903
366
361
[email protected] 2013-04-05 [building sakai] melete 2.9 oracle ...
367
- http://download.gmane.org/gmane.comp.cms. sakai.devel/51413/51414 349265
362
+ http://mbox.dr-chuck.net/ sakai.devel/51413/51414 349265
368
363
[email protected] 2013-04-07 [building sakai] ...
369
- http://download.gmane.org/gmane.comp.cms. sakai.devel/51414/51415 3481
364
+ http://mbox.dr-chuck.net/ sakai.devel/51414/51415 3481
370
365
[email protected] 2013-04-07 re: ...
371
- http://download.gmane.org/gmane.comp.cms. sakai.devel/51415/51416 0
366
+ http://mbox.dr-chuck.net/ sakai.devel/51415/51416 0
372
367
373
368
Does not start with From
374
369
~~~~
@@ -379,7 +374,7 @@ message. It continues spidering until it has spidered the desired number
379
374
of messages or it reaches a page that does not appear to be a properly
380
375
formatted message.
381
376
382
- Sometimes __ gmane.org __ is missing a message. Perhaps
377
+ Sometimes the repository is missing a message. Perhaps
383
378
administrators can delete messages or perhaps they get lost. If your
384
379
spider stops, and it seems it has hit a missing message, go into the
385
380
SQLite Manager and add a row with the missing id leaving all the other
0 commit comments