Skip to content

feat: add pypi attestation discovery #1067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

benmss
Copy link
Member

@benmss benmss commented Apr 24, 2025

Summary

This PR adds discovery of PyPI attestation. URLs to these attestation files are sought via the deps.dev API.

Description of changes

  • DepsDevRepoFinder was updated to use the DepsDevService, ensuring consistent and easily configurable use of the API
  • Tests were added for DepsDevRepoFinder functions (they were not added previously), including for the functions that PyPI attestation discovery relies upon.
  • PyPI attestations do not have a predicate. The pypi-attestation is used to extract information from the attestation certificate. This information is coerced into a predicate for use elsewhere within Macaron.
  • Addition of an integration test case using the ultralytics Python library as its target.

Related issues

Closes #947

@benmss benmss self-assigned this Apr 24, 2025
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Apr 24, 2025
@benmss benmss force-pushed the 947-discover-pypi-attestation branch from b97a9cb to a362d7c Compare April 24, 2025 04:28
@benmss benmss force-pushed the 947-discover-pypi-attestation branch from 2df212b to 6d7cf95 Compare April 24, 2025 06:31
@benmss benmss marked this pull request as ready for review April 24, 2025 13:00
@benmss benmss requested review from behnazh-w and tromai as code owners April 24, 2025 13:00
logger.debug("No predicate in payload statement.")
return None, None

repo = json_extract(predicate, ["sourceUri"], str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the documentation for the predicate of https://docs.pypi.org/attestations/publish/v1/, currently its content is not defined. I wonder if the sourceUri and sourceDigest was copied from an existing implementation or it's purposely put here to catch these data if they are available in the future?
I have decoded the statement of https://pypi.org/integrity/ultralytics/8.3.119/ultralytics-8.3.119.tar.gz/provenance and got

{
  "_type": "https://in-toto.io/Statement/v1",
  "subject": [
    {
      "name": "ultralytics-8.3.119.tar.gz",
      "digest": {
        "sha256": "497bdcf3eb1beb082f451d42e5af2a6af944693a5991c78a9b9b0ce538593153"
      }
    }
  ],
  "predicateType": "https://docs.pypi.org/attestations/publish/v1",
  "predicate": null
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have confused things by converting the certificate information into a predicate and attaching it to the pypi spec that has none. The type and the related function could be renamed to better distinguish them from the actual spec. The actual field names within the substitute predicate were chosen for simplicity, and loosely based on existing predicates I suppose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, I can see the big picture now. I'm okay with some extra documentation here to indicate that we are relying on a Predicate that we built ourselves, and that it's not reflecting the real predicate format from Pypi (as there is None).

benmss added 2 commits May 2, 2025 14:08
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
If the service is misconfigured, the API is invalid, a network error happens,
or unexpected response is returned by the API.
"""
if "%" in purl:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a comment or an example here to explain why we need to convert all "%xx" to their character equivalent.
If I understand it correctly, if we don't convert "%xx", the call to quote / encode below will convert % to "%25", which is not what we want.

"""
if "%" in purl:
purl = decode(purl)
purl = encode(purl, safe="")
Copy link
Member

@tromai tromai May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can use decode even if purl doesn't have any %xx?
If it's possible to do so it can be simplified to

convert_escape_char_purl = decode(purl)
encoded_purl = encode(convert_escape_char_purl, safe="")

Alternatively, I'm happy to keep the if statement too, but perhaps it's better if we use a different variable to stored the mutated version of the original purl string. For example

        convert_escape_char_purl = decode(purl) if "%" in purl else purl
        encoded_purl = encode(convert_escape_char_purl, safe="")

APIAccessError
If the service is misconfigured, the API is invalid, a network error happens,
or unexpected response is returned by the API.
Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc string part should be updated with the new return type hint too.

@@ -20,28 +21,24 @@ class DepsDevService:
"""The deps.dev service class."""

@staticmethod
def get_package_info(purl: str) -> dict | None:
"""Check if the package identified by the PackageURL (PURL) exists and return its information.
def get_endpoint(purl: bool = True, path: str | None = None) -> urllib.parse.SplitResult:
Copy link
Member

@tromai tromai May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of comments about this function

  • Because we have two types of end point for different purposes, the PURL endpoint https://api.deps.dev/v3alpha/purl/ and the BASE endpoint https://api.deps.dev/v3alpha/. I wonder if we could separate them into 2 functions?
def get_base_endpoint(...)
def get_purl_endpoint(...)
  • Could we add a small comment (may be as a doc string) to explicitly say that path is added directly to the end of the URL endpoint without any modification? - This is just to prevent using this method without percent encoding path before hand. Additional examples would be good to have too.

return ""

return response.text
api_endpoint = DepsDevService.get_endpoint(
Copy link
Member

@tromai tromai May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a small example (as a comment) for what the api end point should look like here ?

https://api.deps.dev/v3alpha/systems/pypi/packages/ultralytics/versions/8.3.70


If a version is not specified, remote API calls will be used to try and find one.
@staticmethod
def get_attestation(purl: PackageURL) -> tuple[dict | None, bool]:
Copy link
Member

@tromai tromai May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function intended to support extracting the attestation for all type of PURL or only PYPI one ?
I think this function assumes that the link to the attestation provided by deps.dev (e.g. https://pypi.org/integrity/ultralytics/8.3.70/ultralytics-8.3.70-py3-none-any.whl/provenance) to a PYPI provenance.
If that is True, we might want to rename the function and modify the doc string to reflect this assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is currently only intended for PyPI. I have moved the PyPI specific extraction code to the related registry, and will report unsupported for other types.

"""
Parse the deps.dev json file and extract the repository links.
Extract the repository links from the deps.dev json data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a naive question. Does the BASE endpoint and the PURL endpoint returns the JSON data with similar schema and content ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thanks

@@ -83,6 +103,13 @@ def _load_provenance_file_content(
# Some provenances, such as Witness may not include the DSSE envelope `dsseEnvelope`
# property but contain its value directly.
provenance_payload = provenance.get("payload", None)
if not provenance_payload:
# GitHub Attestation.
# TODO Check if old method (above) actually works.
Copy link
Member

@tromai tromai May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a doc string be added here - to explain the reason why there is a change in handling Github attestation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @behnazh-w could weigh in on this?

if json_payload["predicate"]:
return json_payload

# For provenance without a predicate (e.g. PyPI), try to use the provenance certificate instead.
Copy link
Member

@tromai tromai May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add an if statement to check for predicateType being "https://docs.pypi.org/attestations/publish/v1".
This is for recognizing if predicate being empty is expected (because it's a PYPI attestation) or not (e.g the content is malformed).


Parameters
----------
decoded_certificate: bytes
Copy link
Member

@tromai tromai May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assume that the format of this certificate is Distinguished Encoding Rules (DER) following https://peps.python.org/pep-0740/#provenance-objects. I think we should mention this assumption in the function name and the doc string for extra information.
I also think we can rename decoded_certificate to x509_der_certificate, because the "decoded" part seems to not be in the scope of this function (it has been handled in the function _load_provenance_file_content)


Returns
-------
dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the return type of this function is a dictionary, it only contains a fix number of keys:

  • source_repo
  • source_digest
  • workflow
  • invocation
    Do you think it would be worth capturing this in a TypedDict. Or instead we could return a instance of PyPICertificatePredicate from this function if the dictionary is not used elsewhere.

continue

claim_name = _OID_IDS[extension.oid.dotted_string]
certificate_claims[claim_name] = extension.value.value
Copy link
Member

@tromai tromai May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming that extension here is always of type cryptography.x509.UnrecognizedExtension because of the OIDs we are looking for are not part of the standard X.509 extension type.
If true, I think it's good to have a small comment with that link to api documentation of cryptography.x509.UnrecognizedExtension.
This is due to cryptography.x509.UnrecognizedExtension has the value attribute. Other type of Extension might not have them.

"""
certificate = x509.load_der_x509_certificate(decoded_certificate)
certificate_claims = {}
for extension in certificate.extensions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claim_name = _OID_IDS[extension.oid.dotted_string]
certificate_claims[claim_name] = extension.value.value

for name in _OID_NAMES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if _OID_NAMES is ever used as a dictionary? If not, we can turn it into a list or reuse _OID_IDS.values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was redundant and has been removed.

raise ValueError(f"Missing certificate value: {name}")

# Values are DER encoded UTF-8 strings. Removing the first two bytes seems to be sufficient.
value: str = certificate_claims[name][2:].decode("UTF-8")
Copy link
Member

@tromai tromai May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we perform this operation in the first loop ?

for extension in certificate.extensions:
    ...

I think the purpose of this loop is for validating certificate_claims content, so in my opinion, it's best if we don't do extra mutation to certificate_claims.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have combined the loops.

"attestation_bundles": [{"attestations": [{"foo": "bar"}]}]
}
"""
data = data.replace("*replace_url*", attestation_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can use a formatted multi-line string for data?

purl = PackageURL.from_string("pkg:pypi/test@3")
target_url = (
f"/{deps_dev_service_mock['api']}/"
+ f"{'/'.join(['systems', purl.type, 'packages', purl.name, 'versions', purl.version or ''])}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we have purl.version or '' here. I think purl is not modified at all in this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been removed as part of using PURL for attestation instead of the BASE endpoint.

Comment on lines 68 to 79
def test_get_project_info_failures(
httpserver: HTTPServer, deps_dev_service_mock: dict, repo_url: str, server_url: str, data: str
) -> None:
"""Test get project info failures."""
if not repo_url:
repo_url = f"{deps_dev_service_mock['base_scheme']}://{deps_dev_service_mock['base_netloc']}{server_url}"

if server_url:
target_url = f"/{deps_dev_service_mock['api']}/projects/{deps_dev_service_mock['base_hostname']}{server_url}"
httpserver.expect_request(target_url).respond_with_data(data)

assert not DepsDevRepoFinder().get_project_info(repo_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 2 cases of failures we are testing in this function

  • The Repository URL is given, and not valid
  • The repository URL is available (from pytest-httpserver) but the result JSON is invalid

Could we have 2 separated test functions for each case, so that we can simplify the if conditions on the test inputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have separated the two test cases.

Copy link
Member

@tromai tromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finished my round of review. Thank you.

benmss added 3 commits May 7, 2025 14:10
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Obtain PyPI Publish Attestation
2 participants