Skip to content

Change CollectedLinks to store project page URLs #7073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 11, 2019

Conversation

cjerdonek
Copy link
Member

This is a refactoring PR that changes CollectedLinks to store the project page URLs instead of the mapping of "project page URL" to "list of links". In particular, LinkCollector.collect_links() will no longer do any fetching of pages. Instead, the network / page-fetching code is moved to LinkCollector.fetch_page() (to be called by PackageFinder).

These changes are useful for two main reasons:

  1. It decouples the HTML-parsing code from LinkCollector, so LinkCollector no longer has to know about HTML-parsing. Instead, it more encapsulates just the network aspects, combined with knowing about and processing the relevant CLI options like --find-links. This will help more generally in isolating out the network code from the package-finding logic.

  2. It will make it easier for us to filter out (aka evaluate) links while we're parsing the HTML, instead of storing the full list of links, only later to do another pass over them later to pick out which ones are meaningful. This approach is useful e.g. if the full list of links is gigantic, but only a small fraction of them correspond to project links.

This PR also simplifies PackageFinder.find_all_candidates() by refactoring out a PackageFinder.process_project_url() method, which is responsible for calling LinkCollector.fetch_page() and then doing the HTML-parsing and link evaluation.

@cjerdonek cjerdonek added type: refactor Refactoring code skip news Does not need a NEWS file entry (eg: trivial changes) C: finder PackageFinder and index related code labels Sep 24, 2019
@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label Oct 20, 2019
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Nov 11, 2019
@chrahunt
Copy link
Member

Rebased on master.

@chrahunt chrahunt merged commit 8453fa5 into pypa:master Nov 11, 2019
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Dec 11, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Dec 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: finder PackageFinder and index related code skip news Does not need a NEWS file entry (eg: trivial changes) type: refactor Refactoring code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants