-
Notifications
You must be signed in to change notification settings - Fork 28
feat: add pypi attestation discovery #1067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ben Selwyn-Smith <[email protected]>
b97a9cb
to
a362d7c
Compare
Signed-off-by: Ben Selwyn-Smith <[email protected]>
2df212b
to
6d7cf95
Compare
Signed-off-by: Ben Selwyn-Smith <[email protected]>
logger.debug("No predicate in payload statement.") | ||
return None, None | ||
|
||
repo = json_extract(predicate, ["sourceUri"], str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the documentation for the predicate of https://docs.pypi.org/attestations/publish/v1/
, currently its content is not defined. I wonder if the sourceUri
and sourceDigest
was copied from an existing implementation or it's purposely put here to catch these data if they are available in the future?
I have decoded the statement of https://pypi.org/integrity/ultralytics/8.3.119/ultralytics-8.3.119.tar.gz/provenance and got
{
"_type": "https://in-toto.io/Statement/v1",
"subject": [
{
"name": "ultralytics-8.3.119.tar.gz",
"digest": {
"sha256": "497bdcf3eb1beb082f451d42e5af2a6af944693a5991c78a9b9b0ce538593153"
}
}
],
"predicateType": "https://docs.pypi.org/attestations/publish/v1",
"predicate": null
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have confused things by converting the certificate information into a predicate and attaching it to the pypi spec that has none. The type and the related function could be renamed to better distinguish them from the actual spec. The actual field names within the substitute predicate were chosen for simplicity, and loosely based on existing predicates I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay, I can see the big picture now. I'm okay with some extra documentation here to indicate that we are relying on a Predicate that we built ourselves, and that it's not reflecting the real predicate format from Pypi (as there is None).
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
If the service is misconfigured, the API is invalid, a network error happens, | ||
or unexpected response is returned by the API. | ||
""" | ||
if "%" in purl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have a comment or an example here to explain why we need to convert all "%xx" to their character equivalent.
If I understand it correctly, if we don't convert "%xx", the call to quote / encode below will convert % to "%25", which is not what we want.
""" | ||
if "%" in purl: | ||
purl = decode(purl) | ||
purl = encode(purl, safe="") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can use decode
even if purl
doesn't have any %xx
?
If it's possible to do so it can be simplified to
convert_escape_char_purl = decode(purl)
encoded_purl = encode(convert_escape_char_purl, safe="")
Alternatively, I'm happy to keep the if statement too, but perhaps it's better if we use a different variable to stored the mutated version of the original purl string. For example
convert_escape_char_purl = decode(purl) if "%" in purl else purl
encoded_purl = encode(convert_escape_char_purl, safe="")
APIAccessError | ||
If the service is misconfigured, the API is invalid, a network error happens, | ||
or unexpected response is returned by the API. | ||
Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc string part should be updated with the new return type hint too.
@@ -20,28 +21,24 @@ class DepsDevService: | |||
"""The deps.dev service class.""" | |||
|
|||
@staticmethod | |||
def get_package_info(purl: str) -> dict | None: | |||
"""Check if the package identified by the PackageURL (PURL) exists and return its information. | |||
def get_endpoint(purl: bool = True, path: str | None = None) -> urllib.parse.SplitResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of comments about this function
- Because we have two types of end point for different purposes, the PURL endpoint
https://api.deps.dev/v3alpha/purl/
and the BASE endpointhttps://api.deps.dev/v3alpha/
. I wonder if we could separate them into 2 functions?
def get_base_endpoint(...)
def get_purl_endpoint(...)
- Could we add a small comment (may be as a doc string) to explicitly say that
path
is added directly to the end of the URL endpoint without any modification? - This is just to prevent using this method without percent encodingpath
before hand. Additional examples would be good to have too.
return "" | ||
|
||
return response.text | ||
api_endpoint = DepsDevService.get_endpoint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a small example (as a comment) for what the api end point should look like here ?
https://api.deps.dev/v3alpha/systems/pypi/packages/ultralytics/versions/8.3.70
|
||
If a version is not specified, remote API calls will be used to try and find one. | ||
@staticmethod | ||
def get_attestation(purl: PackageURL) -> tuple[dict | None, bool]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function intended to support extracting the attestation for all type of PURL or only PYPI one ?
I think this function assumes that the link to the attestation provided by deps.dev (e.g. https://pypi.org/integrity/ultralytics/8.3.70/ultralytics-8.3.70-py3-none-any.whl/provenance) to a PYPI provenance.
If that is True, we might want to rename the function and modify the doc string to reflect this assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is currently only intended for PyPI. I have moved the PyPI specific extraction code to the related registry, and will report unsupported for other types.
""" | ||
Parse the deps.dev json file and extract the repository links. | ||
Extract the repository links from the deps.dev json data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a naive question. Does the BASE endpoint and the PURL endpoint returns the JSON data with similar schema and content ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They return slightly different results in terms of the schema.
https://api.deps.dev/v3alpha/purl/pkg%3Anpm%2F%40colors%[email protected]
https://api.deps.dev/v3alpha/systems/npm/packages/%40colors%2Fcolors/versions/1.4.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thanks
@@ -83,6 +103,13 @@ def _load_provenance_file_content( | |||
# Some provenances, such as Witness may not include the DSSE envelope `dsseEnvelope` | |||
# property but contain its value directly. | |||
provenance_payload = provenance.get("payload", None) | |||
if not provenance_payload: | |||
# GitHub Attestation. | |||
# TODO Check if old method (above) actually works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a doc string be added here - to explain the reason why there is a change in handling Github attestation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @behnazh-w could weigh in on this?
if json_payload["predicate"]: | ||
return json_payload | ||
|
||
# For provenance without a predicate (e.g. PyPI), try to use the provenance certificate instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add an if statement to check for predicateType being "https://docs.pypi.org/attestations/publish/v1".
This is for recognizing if predicate
being empty is expected (because it's a PYPI attestation) or not (e.g the content is malformed).
|
||
Parameters | ||
---------- | ||
decoded_certificate: bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We assume that the format of this certificate is Distinguished Encoding Rules (DER) following https://peps.python.org/pep-0740/#provenance-objects. I think we should mention this assumption in the function name and the doc string for extra information.
I also think we can rename decoded_certificate
to x509_der_certificate
, because the "decoded" part seems to not be in the scope of this function (it has been handled in the function _load_provenance_file_content
)
|
||
Returns | ||
------- | ||
dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the return type of this function is a dictionary, it only contains a fix number of keys:
- source_repo
- source_digest
- workflow
- invocation
Do you think it would be worth capturing this in aTypedDict
. Or instead we could return a instance ofPyPICertificatePredicate
from this function if the dictionary is not used elsewhere.
continue | ||
|
||
claim_name = _OID_IDS[extension.oid.dotted_string] | ||
certificate_claims[claim_name] = extension.value.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming that extension
here is always of type cryptography.x509.UnrecognizedExtension because of the OIDs we are looking for are not part of the standard X.509 extension type.
If true, I think it's good to have a small comment with that link to api documentation of cryptography.x509.UnrecognizedExtension
.
This is due to cryptography.x509.UnrecognizedExtension
has the value
attribute. Other type of Extension might not have them.
""" | ||
certificate = x509.load_der_x509_certificate(decoded_certificate) | ||
certificate_claims = {} | ||
for extension in certificate.extensions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think assessing certificate.extensions
can raise Error - https://cryptography.io/en/latest/x509/reference/#cryptography.x509.Certificate.extensions
claim_name = _OID_IDS[extension.oid.dotted_string] | ||
certificate_claims[claim_name] = extension.value.value | ||
|
||
for name in _OID_NAMES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if _OID_NAMES
is ever used as a dictionary? If not, we can turn it into a list or reuse _OID_IDS.values
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was redundant and has been removed.
raise ValueError(f"Missing certificate value: {name}") | ||
|
||
# Values are DER encoded UTF-8 strings. Removing the first two bytes seems to be sufficient. | ||
value: str = certificate_claims[name][2:].decode("UTF-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we perform this operation in the first loop ?
for extension in certificate.extensions:
...
I think the purpose of this loop is for validating certificate_claims
content, so in my opinion, it's best if we don't do extra mutation to certificate_claims
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have combined the loops.
"attestation_bundles": [{"attestations": [{"foo": "bar"}]}] | ||
} | ||
""" | ||
data = data.replace("*replace_url*", attestation_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can use a formatted multi-line string for data
?
purl = PackageURL.from_string("pkg:pypi/test@3") | ||
target_url = ( | ||
f"/{deps_dev_service_mock['api']}/" | ||
+ f"{'/'.join(['systems', purl.type, 'packages', purl.name, 'versions', purl.version or ''])}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why we have purl.version or ''
here. I think purl
is not modified at all in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been removed as part of using PURL for attestation instead of the BASE endpoint.
def test_get_project_info_failures( | ||
httpserver: HTTPServer, deps_dev_service_mock: dict, repo_url: str, server_url: str, data: str | ||
) -> None: | ||
"""Test get project info failures.""" | ||
if not repo_url: | ||
repo_url = f"{deps_dev_service_mock['base_scheme']}://{deps_dev_service_mock['base_netloc']}{server_url}" | ||
|
||
if server_url: | ||
target_url = f"/{deps_dev_service_mock['api']}/projects/{deps_dev_service_mock['base_hostname']}{server_url}" | ||
httpserver.expect_request(target_url).respond_with_data(data) | ||
|
||
assert not DepsDevRepoFinder().get_project_info(repo_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are 2 cases of failures we are testing in this function
- The Repository URL is given, and not valid
- The repository URL is available (from pytest-httpserver) but the result JSON is invalid
Could we have 2 separated test functions for each case, so that we can simplify the if conditions on the test inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have separated the two test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have finished my round of review. Thank you.
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Summary
This PR adds discovery of PyPI attestation. URLs to these attestation files are sought via the deps.dev API.
Description of changes
DepsDevRepoFinder
was updated to use theDepsDevService
, ensuring consistent and easily configurable use of the APIDepsDevRepoFinder
functions (they were not added previously), including for the functions that PyPI attestation discovery relies upon.pypi-attestation
is used to extract information from the attestation certificate. This information is coerced into a predicate for use elsewhere within Macaron.ultralytics
Python library as its target.Related issues
Closes #947