Change CollectedLinks to store project page URLs #7073
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a refactoring PR that changes
CollectedLinks
to store the project page URLs instead of the mapping of "project page URL" to "list of links". In particular,LinkCollector.collect_links()
will no longer do any fetching of pages. Instead, the network / page-fetching code is moved toLinkCollector.fetch_page()
(to be called byPackageFinder
).These changes are useful for two main reasons:
It decouples the HTML-parsing code from
LinkCollector
, soLinkCollector
no longer has to know about HTML-parsing. Instead, it more encapsulates just the network aspects, combined with knowing about and processing the relevant CLI options like--find-links
. This will help more generally in isolating out the network code from the package-finding logic.It will make it easier for us to filter out (aka evaluate) links while we're parsing the HTML, instead of storing the full list of links, only later to do another pass over them later to pick out which ones are meaningful. This approach is useful e.g. if the full list of links is gigantic, but only a small fraction of them correspond to project links.
This PR also simplifies
PackageFinder.find_all_candidates()
by refactoring out aPackageFinder.process_project_url()
method, which is responsible for callingLinkCollector.fetch_page()
and then doing the HTML-parsing and link evaluation.