Skip to content

Commit 8453fa5

Browse files
authored
Merge pull request #7073 from cjerdonek/remove-get-pages
Change CollectedLinks to store project page URLs
2 parents 105e7bd + 9bd0db2 commit 8453fa5

File tree

6 files changed

+109
-69
lines changed

6 files changed

+109
-69
lines changed

docs/html/development/architecture/package-finding.rst

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,16 @@ Overview
1515
Here is a rough description of the process that pip uses to choose what
1616
file to download for a package, given a requirement:
1717

18-
1. Access the various network and file system locations configured for pip
19-
that contain package files. These locations can include, for example,
20-
pip's :ref:`--index-url <--index-url>` (with default
21-
https://pypi.org/simple/ ) and any configured
22-
:ref:`--extra-index-url <--extra-index-url>` locations.
23-
Each of these locations is a `PEP 503`_ "simple repository" page, which
24-
is an HTML page of anchor links.
25-
2. Collect together all of the links (e.g. by parsing the anchor links
26-
from the HTML pages) and create ``Link`` objects from each of these.
27-
The :ref:`LinkCollector <link-collector-class>` class is responsible
28-
for both this step and the previous.
18+
1. Collect together the various network and file system locations containing
19+
project package files. These locations are derived, for example, from pip's
20+
:ref:`--index-url <--index-url>` (with default https://pypi.org/simple/ )
21+
setting and any configured :ref:`--extra-index-url <--extra-index-url>`
22+
locations. Each of the project page URL's is an HTML page of anchor links,
23+
as defined in `PEP 503`_, the "Simple Repository API."
24+
2. For each project page URL, fetch the HTML and parse out the anchor links,
25+
creating a ``Link`` object from each one. The :ref:`LinkCollector
26+
<link-collector-class>` class is responsible for both the previous step
27+
and fetching the HTML over the network.
2928
3. Determine which of the links are minimally relevant, using the
3029
:ref:`LinkEvaluator <link-evaluator-class>` class. Create an
3130
``InstallationCandidate`` object (aka candidate for install) for each
@@ -111,6 +110,12 @@ One of ``PackageFinder``'s main top-level methods is
111110
class's ``compute_best_candidate()`` method on the return value of
112111
``find_all_candidates()``. This corresponds to steps 4-5 of the Overview.
113112

113+
``PackageFinder`` also has a ``process_project_url()`` method (called by
114+
``find_best_candidate()``) to process a `PEP 503`_ "simple repository"
115+
project page. This method fetches and parses the HTML from a PEP 503 project
116+
page URL, extracts the anchor elements and creates ``Link`` objects from
117+
them, and then evaluates those links.
118+
114119

115120
.. _link-collector-class:
116121

@@ -119,12 +124,8 @@ The ``LinkCollector`` class
119124

120125
The :ref:`LinkCollector <link-collector-class>` class is the class
121126
responsible for collecting the raw list of "links" to package files
122-
(represented as ``Link`` objects). An instance of the class accesses the
123-
various `PEP 503`_ HTML "simple repository" pages, parses their HTML,
124-
extracts the links from the anchor elements, and creates ``Link`` objects
125-
from that information. The ``LinkCollector`` class is "unintelligent" in that
126-
it doesn't do any evaluation of whether the links are relevant to the
127-
original requirement; it just collects them.
127+
(represented as ``Link`` objects) from file system locations, as well as the
128+
`PEP 503`_ project page URL's that ``PackageFinder`` should access.
128129

129130
The ``LinkCollector`` class takes into account the user's :ref:`--find-links
130131
<--find-links>`, :ref:`--extra-index-url <--extra-index-url>`, and related
@@ -133,6 +134,10 @@ method is the ``collect_links()`` method. The :ref:`PackageFinder
133134
<package-finder-class>` class invokes this method as the first step of its
134135
``find_all_candidates()`` method.
135136

137+
``LinkCollector`` also has a ``fetch_page()`` method to fetch the HTML from a
138+
project page URL. This method is "unintelligent" in that it doesn't parse the
139+
HTML.
140+
136141
The ``LinkCollector`` class is the only class in the ``index`` sub-package that
137142
makes network requests and is the only class in the sub-package that depends
138143
directly on ``PipSession``, which stores pip's configuration options and

src/pip/_internal/index/collector.py

Lines changed: 22 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@
2727

2828
if MYPY_CHECK_RUNNING:
2929
from typing import (
30-
Callable, Dict, Iterable, List, MutableMapping, Optional, Sequence,
31-
Tuple, Union,
30+
Callable, Iterable, List, MutableMapping, Optional, Sequence, Tuple,
31+
Union,
3232
)
3333
import xml.etree.ElementTree
3434

@@ -435,29 +435,36 @@ def sort_path(path):
435435
class CollectedLinks(object):
436436

437437
"""
438-
Encapsulates all the Link objects collected by a call to
439-
LinkCollector.collect_links(), stored separately as--
438+
Encapsulates the return value of a call to LinkCollector.collect_links().
439+
440+
The return value includes both URLs to project pages containing package
441+
links, as well as individual package Link objects collected from other
442+
sources.
443+
444+
This info is stored separately as:
440445
441446
(1) links from the configured file locations,
442447
(2) links from the configured find_links, and
443-
(3) a dict mapping HTML page url to links from that page.
448+
(3) urls to HTML project pages, as described by the PEP 503 simple
449+
repository API.
444450
"""
445451

446452
def __init__(
447453
self,
448-
files, # type: List[Link]
449-
find_links, # type: List[Link]
450-
pages, # type: Dict[str, List[Link]]
454+
files, # type: List[Link]
455+
find_links, # type: List[Link]
456+
project_urls, # type: List[Link]
451457
):
452458
# type: (...) -> None
453459
"""
454460
:param files: Links from file locations.
455461
:param find_links: Links from find_links.
456-
:param pages: A dict mapping HTML page url to links from that page.
462+
:param project_urls: URLs to HTML project pages, as described by
463+
the PEP 503 simple repository API.
457464
"""
458465
self.files = files
459466
self.find_links = find_links
460-
self.pages = pages
467+
self.project_urls = project_urls
461468

462469

463470
class LinkCollector(object):
@@ -483,18 +490,12 @@ def find_links(self):
483490
# type: () -> List[str]
484491
return self.search_scope.find_links
485492

486-
def _get_pages(self, locations):
487-
# type: (Iterable[Link]) -> Iterable[HTMLPage]
493+
def fetch_page(self, location):
494+
# type: (Link) -> Optional[HTMLPage]
488495
"""
489-
Yields (page, page_url) from the given locations, skipping
490-
locations that have errors.
496+
Fetch an HTML page containing package links.
491497
"""
492-
for location in locations:
493-
page = _get_html_page(location, session=self.session)
494-
if page is None:
495-
continue
496-
497-
yield page
498+
return _get_html_page(location, session=self.session)
498499

499500
def collect_links(self, project_name):
500501
# type: (str) -> CollectedLinks
@@ -537,12 +538,8 @@ def collect_links(self, project_name):
537538
lines.append('* {}'.format(link))
538539
logger.debug('\n'.join(lines))
539540

540-
pages_links = {}
541-
for page in self._get_pages(url_locations):
542-
pages_links[page.url] = list(parse_links(page))
543-
544541
return CollectedLinks(
545542
files=file_links,
546543
find_links=find_link_links,
547-
pages=pages_links,
544+
project_urls=url_locations,
548545
)

src/pip/_internal/index/package_finder.py

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
InvalidWheelFilename,
2020
UnsupportedWheel,
2121
)
22+
from pip._internal.index.collector import parse_links
2223
from pip._internal.models.candidate import InstallationCandidate
2324
from pip._internal.models.format_control import FormatControl
2425
from pip._internal.models.link import Link
@@ -778,6 +779,25 @@ def evaluate_links(self, link_evaluator, links):
778779

779780
return candidates
780781

782+
def process_project_url(self, project_url, link_evaluator):
783+
# type: (Link, LinkEvaluator) -> List[InstallationCandidate]
784+
logger.debug(
785+
'Fetching project page and analyzing links: %s', project_url,
786+
)
787+
html_page = self._link_collector.fetch_page(project_url)
788+
if html_page is None:
789+
return []
790+
791+
page_links = list(parse_links(html_page))
792+
793+
with indent_log():
794+
package_links = self.evaluate_links(
795+
link_evaluator,
796+
links=page_links,
797+
)
798+
799+
return package_links
800+
781801
def find_all_candidates(self, project_name):
782802
# type: (str) -> List[InstallationCandidate]
783803
"""Find all available InstallationCandidate for project_name
@@ -798,14 +818,11 @@ def find_all_candidates(self, project_name):
798818
)
799819

800820
page_versions = []
801-
for page_url, page_links in collected_links.pages.items():
802-
logger.debug('Analyzing links from page %s', page_url)
803-
with indent_log():
804-
new_versions = self.evaluate_links(
805-
link_evaluator,
806-
links=page_links,
807-
)
808-
page_versions.extend(new_versions)
821+
for project_url in collected_links.project_urls:
822+
package_links = self.process_project_url(
823+
project_url, link_evaluator=link_evaluator,
824+
)
825+
page_versions.extend(package_links)
809826

810827
file_versions = self.evaluate_links(
811828
link_evaluator,

tests/functional/test_install_config.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ def test_command_line_append_flags(script, virtualenv, data):
9191
'test.pypi.org',
9292
)
9393
assert (
94-
"Analyzing links from page https://test.pypi.org"
94+
"Fetching project page and analyzing links: https://test.pypi.org"
9595
in result.stdout
9696
), str(result)
9797
virtualenv.clear()
@@ -100,7 +100,7 @@ def test_command_line_append_flags(script, virtualenv, data):
100100
'--trusted-host', 'test.pypi.org',
101101
)
102102
assert (
103-
"Analyzing links from page https://test.pypi.org"
103+
"Fetching project page and analyzing links: https://test.pypi.org"
104104
in result.stdout
105105
)
106106
assert (
@@ -124,7 +124,7 @@ def test_command_line_appends_correctly(script, data):
124124
)
125125

126126
assert (
127-
"Analyzing links from page https://test.pypi.org"
127+
"Fetching project page and analyzing links: https://test.pypi.org"
128128
in result.stdout
129129
), result.stdout
130130
assert (

tests/unit/test_collector.py

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -436,13 +436,28 @@ def check_links_include(links, names):
436436
class TestLinkCollector(object):
437437

438438
@patch('pip._internal.index.collector._get_html_response')
439-
def test_collect_links(self, mock_get_html_response, caplog, data):
440-
caplog.set_level(logging.DEBUG)
439+
def test_fetch_page(self, mock_get_html_response):
440+
url = 'https://pypi.org/simple/twine/'
441+
442+
fake_response = make_fake_html_response(url)
443+
mock_get_html_response.return_value = fake_response
441444

442-
expected_url = 'https://pypi.org/simple/twine/'
445+
location = Link(url)
446+
link_collector = make_test_link_collector()
447+
actual = link_collector.fetch_page(location)
443448

444-
fake_page = make_fake_html_response(expected_url)
445-
mock_get_html_response.return_value = fake_page
449+
assert actual.content == fake_response.content
450+
assert actual.encoding is None
451+
assert actual.url == url
452+
453+
# Also check that the right session object was passed to
454+
# _get_html_response().
455+
mock_get_html_response.assert_called_once_with(
456+
url, session=link_collector.session,
457+
)
458+
459+
def test_collect_links(self, caplog, data):
460+
caplog.set_level(logging.DEBUG)
446461

447462
link_collector = make_test_link_collector(
448463
find_links=[data.find_links],
@@ -452,24 +467,14 @@ def test_collect_links(self, mock_get_html_response, caplog, data):
452467
)
453468
actual = link_collector.collect_links('twine')
454469

455-
mock_get_html_response.assert_called_once_with(
456-
expected_url, session=link_collector.session,
457-
)
458-
459470
# Spot-check the CollectedLinks return value.
460471
assert len(actual.files) > 20
461472
check_links_include(actual.files, names=['simple-1.0.tar.gz'])
462473

463474
assert len(actual.find_links) == 1
464475
check_links_include(actual.find_links, names=['packages'])
465476

466-
actual_pages = actual.pages
467-
assert list(actual_pages) == [expected_url]
468-
actual_page_links = actual_pages[expected_url]
469-
assert len(actual_page_links) == 1
470-
assert actual_page_links[0].url == (
471-
'https://pypi.org/abc-1.0.tar.gz#md5=000000000'
472-
)
477+
assert actual.project_urls == [Link('https://pypi.org/simple/twine/')]
473478

474479
expected_message = dedent("""\
475480
1 location(s) to search for versions of twine:

tests/unit/test_finder.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -473,6 +473,22 @@ def test_evaluate_link__substring_fails(self, url, expected_msg):
473473
assert actual == (False, expected_msg)
474474

475475

476+
def test_process_project_url(data):
477+
project_name = 'simple'
478+
index_url = data.index_url('simple')
479+
project_url = Link('{}/{}'.format(index_url, project_name))
480+
finder = make_test_finder(index_urls=[index_url])
481+
link_evaluator = finder.make_link_evaluator(project_name)
482+
actual = finder.process_project_url(
483+
project_url, link_evaluator=link_evaluator,
484+
)
485+
486+
assert len(actual) == 1
487+
package_link = actual[0]
488+
assert package_link.project == 'simple'
489+
assert str(package_link.version) == '1.0'
490+
491+
476492
def test_find_all_candidates_nothing():
477493
"""Find nothing without anything"""
478494
finder = make_test_finder()

0 commit comments

Comments
 (0)