Change CollectedLinks to store project page URLs #7073

cjerdonek · 2019-09-24T07:37:00Z

This is a refactoring PR that changes CollectedLinks to store the project page URLs instead of the mapping of "project page URL" to "list of links". In particular, LinkCollector.collect_links() will no longer do any fetching of pages. Instead, the network / page-fetching code is moved to LinkCollector.fetch_page() (to be called by PackageFinder).

These changes are useful for two main reasons:

It decouples the HTML-parsing code from LinkCollector, so LinkCollector no longer has to know about HTML-parsing. Instead, it more encapsulates just the network aspects, combined with knowing about and processing the relevant CLI options like --find-links. This will help more generally in isolating out the network code from the package-finding logic.
It will make it easier for us to filter out (aka evaluate) links while we're parsing the HTML, instead of storing the full list of links, only later to do another pass over them later to pick out which ones are meaningful. This approach is useful e.g. if the full list of links is gigantic, but only a small fraction of them correspond to project links.

This PR also simplifies PackageFinder.find_all_candidates() by refactoring out a PackageFinder.process_project_url() method, which is responsible for calling LinkCollector.fetch_page() and then doing the HTML-parsing and link evaluation.

BrownTruck · 2019-10-20T13:45:02Z

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

chrahunt · 2019-11-11T02:28:13Z

Rebased on master.

cjerdonek added type: refactor Refactoring code skip news Does not need a NEWS file entry (eg: trivial changes) C: finder PackageFinder and index related code labels Sep 24, 2019

cjerdonek force-pushed the remove-get-pages branch from 90f00de to 7985d6a Compare September 24, 2019 07:48

pradyunsg approved these changes Sep 24, 2019

View reviewed changes

BrownTruck added the needs rebase or merge PR has conflicts with current master label Oct 20, 2019

cjerdonek added 4 commits November 10, 2019 21:19

Add LinkCollector.fetch_page().

024038c

Change CollectedLinks to store project_urls.

bab1e4f

Add PackageFinder.process_project_url(), and test.

f4cad3d

Update the PackageFinder architecture document.

9bd0db2

chrahunt force-pushed the remove-get-pages branch from 7985d6a to 9bd0db2 Compare November 11, 2019 02:27

pypa-bot removed the needs rebase or merge PR has conflicts with current master label Nov 11, 2019

chrahunt approved these changes Nov 11, 2019

View reviewed changes

chrahunt merged commit 8453fa5 into pypa:master Nov 11, 2019

lock bot added the auto-locked Outdated issues that have been locked by automation label Dec 11, 2019

lock bot locked as resolved and limited conversation to collaborators Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change CollectedLinks to store project page URLs #7073

Change CollectedLinks to store project page URLs #7073

Uh oh!

cjerdonek commented Sep 24, 2019

Uh oh!

BrownTruck commented Oct 20, 2019

Uh oh!

chrahunt commented Nov 11, 2019

Uh oh!

Uh oh!

Change CollectedLinks to store project page URLs #7073

Change CollectedLinks to store project page URLs #7073

Uh oh!

Conversation

cjerdonek commented Sep 24, 2019

Uh oh!

BrownTruck commented Oct 20, 2019

Uh oh!

chrahunt commented Nov 11, 2019

Uh oh!

Uh oh!