Skip to content

Implement package assembly in scancode.io #485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 25, 2022
Merged

Conversation

JonoYang
Copy link
Member

@JonoYang JonoYang commented Aug 4, 2022

This PR updates the application package scanning step in scancode.io to reflect the updates made to package scanning in scancode-toolkit. Package scanning is now a two step process, where we first detect Package data from CodebaseResources, then we iterate through all the Resources with Package data and process them using the new Package handlers from scancode-toolkit.

@tdruez
Copy link
Contributor

tdruez commented Aug 5, 2022

@JonoYang thanks for splitting those PR, much easier to review and test.

  1. I'm getting the following error running the scan_codebase on https://files.pythonhosted.org/packages/9b/41/e1e7d6ecc3bc76681dfdc6b373566822bc2aab96fa3eceaaed70accc28b6/Django-4.1-py3-none-any.whl
DiscoveredPackage matching query does not exist.

Traceback:
  File "scancode.io/scanpipe/pipelines/__init__.py", line 115, in execute
    step(self)
  File "scancode.io/scanpipe/pipelines/scan_codebase.py", line 93, in scan_for_application_packages
    scancode.scan_for_application_packages(self.project)
  File "scancode.io/scanpipe/pipes/scancode.py", line 332, in scan_for_application_packages
    assemble_packages(project=project)
  File "scancode.io/scanpipe/pipes/scancode.py", line 375, in assemble_packages
    for item in items:
  File "scancode.io/lib/python3.9/site-packages/packagedcode/models.py", line 941, in assemble
    cls.assign_package_to_resources(
  File "scancode.io/lib/python3.9/site-packages/packagedcode/models.py", line 1005, in assign_package_to_resources
    package_adder(package_uid, resource, codebase)
  File "scancode.io/scanpipe/pipes/scancode.py", line 344, in add_to_package
    package = project.discoveredpackages.get(package_uid=package_uid)
  File "scancode.io/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "scancode.io/lib/python3.9/site-packages/django/db/models/query.py", line 496, in get
    raise self.model.DoesNotExist(
  1. We should add a simple test for assemble_packages, to expose how a couple resources with package_data are assembled into a package.

@tdruez
Copy link
Contributor

tdruez commented Aug 8, 2022

@JonoYang Did you have the chance to investigate the issue reported above?

@JonoYang
Copy link
Member Author

JonoYang commented Aug 8, 2022

@tdruez The issue is caused by DatafileHandler.assemble_from_many(), where Resources are yielded before Packages. I've set up the CodebaseResource creation logic in load_inventory and scan_codebase to create DiscoveredPackages before the CodebaseResources of a Project, such that I can associate the CodebaseResources to their DiscoveredPackages. I've modified the .assemble() methods of other DatafileHandlers to yield Packages before dependencies and resources, but I'm also looking into sorting these objects on the scancode.io side in the assemble_packages function in my PR.

This is my PR on scancode-toolkit that modifies DatafileHandler.assemble_from_many() to yield Packages before Resources and Dependencies
aboutcode-org/scancode-toolkit#3042

@JonoYang
Copy link
Member Author

JonoYang commented Aug 9, 2022

Sorting the items yielded by .assemble() doesn't work. We still run into the issue where we need a DiscoveredPackage to exist for a Project before we can relate a CodebaseResource to it when the .assemble() method code is determining the resources of a package.

@tdruez
Copy link
Contributor

tdruez commented Aug 9, 2022

@JonoYang could you extract the changes related to #467 into it's own PR so we can get this one merge while we are sorting the assembly issues?

@JonoYang JonoYang force-pushed the implement-package-assembly branch 3 times, most recently from 901737b to 3da7a4e Compare August 10, 2022 01:06
@JonoYang
Copy link
Member Author

@tdruez

I've updated CodebaseResource.for_packages to return a list of package_uids of the DiscoveredPackages associated to the CodebaseResource. package_uid values are different every time they are generated, so in tests where we compare package_uid values, we normalize them using purl_with_fake_uuid. I was wondering what was the best way to normalize a package_uid from the CSV output test (https://github.com/nexB/scancode.io/blob/implement-package-assembly/scanpipe/tests/test_pipes.py#L132)

@JonoYang JonoYang force-pushed the implement-package-assembly branch from 3da7a4e to 6bc7a2b Compare August 11, 2022 23:42
@JonoYang
Copy link
Member Author

I've updated CodebaseResource.for_packages to return a list of package_uids of the DiscoveredPackages associated to the CodebaseResource. package_uid values are different every time they are generated, so in tests where we compare package_uid values, we normalize them using purl_with_fake_uuid. I was wondering what was the best way to normalize a package_uid from the CSV output test (https://github.com/nexB/scancode.io/blob/implement-package-assembly/scanpipe/tests/test_pipes.py#L132)

Looking at it again, I now realize there is no need to normalize the package_uids in the for_packages field in the CSV tests because the Package data is set in code and in the fixtures and we will always get the same package_uid returned. In scancode-toolkit, we need to normalize the package_uids because scancode generates a unique package_uid on each test run.

@JonoYang JonoYang force-pushed the implement-package-assembly branch 4 times, most recently from a5f926d to db94a74 Compare August 12, 2022 17:54
@tdruez
Copy link
Contributor

tdruez commented Aug 15, 2022

@JonoYang I'm not sure your latest changes were intended to fix #485 (comment) but I'm still getting the same issue with the latest code.

@JonoYang
Copy link
Member Author

@JonoYang I'm not sure your latest changes were intended to fix #485 (comment) but I'm still getting the same issue with the latest code.

The PR with this fix has been merged into scancode-toolkit and should be available in the next version and I'll bump the scancode version in scio when it comes out. (aboutcode-org/scancode-toolkit#3042)

@JonoYang JonoYang force-pushed the implement-package-assembly branch from db94a74 to bca9ecb Compare August 15, 2022 18:05
@tdruez
Copy link
Contributor

tdruez commented Aug 18, 2022

@JonoYang I've merged the latest main branch including the ScanCode-toolkit v31.0.1 version.

  1. The issue "DiscoveredPackage matching query does not exist." from Implement package assembly in scancode.io #485 (comment) appears to be fixed.

I'm seeing an extra resource associated to the second package using the main branch that I do not see in this current branch:

  • main
    Screenshot 2022-08-18 at 15 15 14

  • implement-package-assembly
    Screenshot 2022-08-18 at 15 16 03

Also, shouldn't we only detect 1 package here?


  1. I'm still getting this issue in the assemble_packages() with this branch running the docker pipeline on docker://python:3.8-slim-buster:
CodebaseResource matching query does not exist.

Traceback:
  File "/app/scanpipe/pipelines/__init__.py", line 115, in execute
    step(self)
  File "/app/scanpipe/pipelines/root_filesystems.py", line 118, in scan_for_application_packages
    scancode.scan_for_application_packages(self.project)
  File "/app/scanpipe/pipes/scancode.py", line 332, in scan_for_application_packages
    assemble_packages(project=project)
  File "/app/scanpipe/pipes/scancode.py", line 375, in assemble_packages
    for item in items:
  File "/usr/local/lib/python3.9/site-packages/packagedcode/models.py", line 949, in assemble
    cls.assign_package_to_resources(
  File "/usr/local/lib/python3.9/site-packages/packagedcode/pypi.py", line 319, in assign_package_to_resources
    site_packages = resource.parent(codebase).parent(codebase).parent(codebase)
  File "/app/scanpipe/models.py", line 1632, in parent
    return parent_path and self.project.codebaseresources.get(path=parent_path)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 650, in get
    raise self.model.DoesNotExist(

@JonoYang
Copy link
Member Author

@tdruez

CodebaseResource matching query does not exist.

Traceback:
File "/app/scanpipe/pipelines/init.py", line 115, in execute
step(self)
File "/app/scanpipe/pipelines/root_filesystems.py", line 118, in scan_for_application_packages
scancode.scan_for_application_packages(self.project)
File "/app/scanpipe/pipes/scancode.py", line 332, in scan_for_application_packages
assemble_packages(project=project)
File "/app/scanpipe/pipes/scancode.py", line 375, in assemble_packages
for item in items:
File "/usr/local/lib/python3.9/site-packages/packagedcode/models.py", line 949, in assemble
cls.assign_package_to_resources(
File "/usr/local/lib/python3.9/site-packages/packagedcode/pypi.py", line 319, in assign_package_to_resources
site_packages = resource.parent(codebase).parent(codebase).parent(codebase)
File "/app/scanpipe/models.py", line 1632, in parent
return parent_path and self.project.codebaseresources.get(path=parent_path)
File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 650, in get
raise self.model.DoesNotExist(

This error is happening because we do not create the directory Resources when we create Resources from a docker image. The fix for this is to set with_dir=True as an argument into layer.get_resources() (https://github.com/nexB/scancode.io/blob/main/scanpipe/pipes/docker.py#L176)

After fixing that issue, the pipeline failed from a syntax error in PythonInstalledWheelMetadataFile.assign_package_to_resources(). I've created a PR for it (aboutcode-org/scancode-toolkit#3059)

After fixing the syntax error in PythonInstalledWheelMetadataFile.assign_package_to_resources(), I'm able to run the docker pipeline without errors.
image

@JonoYang
Copy link
Member Author

JonoYang commented Aug 18, 2022

I'm seeing an extra resource associated to the second package using the main branch that I do not see in this current branch:

main
Screenshot 2022-08-18 at 15 15 14

implement-package-assembly
Screenshot 2022-08-18 at 15 16 03

Also, shouldn't we only detect 1 package here?

With the current logic we have in packagedcode, there should be two Packages detected: 1 from the wheel itself and 1 from the extracted METADATA file. We would have 1 detected package if the scan_codebase pipeline did not have an extraction step.

INFO Project test8 assemble_packages:
INFO   Processing: Django-4.1-py3-none-any.whl
INFO   Package data: pkg:pypi/[email protected]
INFO   Selected package handler: PypiWheelHandler
INFO     Processing item: Package(type='pypi', namespace=None, name='Django', version='4.1', datasource_id='pypi_wheel')
INFO     Processing item: Dependency(purl='pkg:pypi/asgiref', extracted_requirement='asgiref<4,>=3.5.2', scope='install')
INFO     Processing item: Dependency(purl='pkg:pypi/sqlparse', extracted_requirement='sqlparse>=0.2.2', scope='install')
INFO     Processing item: Dependency(purl='pkg:pypi/backports-zoneinfo', extracted_requirement='backports.zoneinfo; python_version < "3.9"', scope='install')
INFO     Processing item: Dependency(purl='pkg:pypi/tzdata', extracted_requirement='tzdata; sys_platform == "win32"', scope='install')
INFO     Processing item: Dependency(purl='pkg:pypi/argon2-cffi', extracted_requirement='argon2-cffi>=19.1.0; extra == "argon2"', scope='argon2')
INFO     Processing item: Dependency(purl='pkg:pypi/bcrypt', extracted_requirement='bcrypt; extra == "bcrypt"', scope='bcrypt')
INFO     Processing item: Django-4.1-py3-none-any.whl
INFO   Processing: Django-4.1-py3-none-any.whl-extract/Django-4.1.dist-info/METADATA
INFO   Package data: pkg:pypi/[email protected]
INFO   Selected package handler: PythonInstalledWheelMetadataFile
INFO     Processing item: Package(type='pypi', namespace=None, name='Django', version='4.1', datasource_id='pypi_wheel_metadata')
PythonInstalledWheelMetadataFile: assign_packages_to_resources: Django-4.1-py3-none-any.whl-extract/Django-4.1.dist-info/METADATA
INFO     Processing item: Dependency(purl='pkg:pypi/asgiref', extracted_requirement='asgiref<4,>=3.5.2', scope='install')
INFO     Processing item: Dependency(purl='pkg:pypi/sqlparse', extracted_requirement='sqlparse>=0.2.2', scope='install')
INFO     Processing item: Dependency(purl='pkg:pypi/backports-zoneinfo', extracted_requirement='backports.zoneinfo; python_version < "3.9"', scope='install')
INFO     Processing item: Dependency(purl='pkg:pypi/tzdata', extracted_requirement='tzdata; sys_platform == "win32"', scope='install')
INFO     Processing item: Dependency(purl='pkg:pypi/argon2-cffi', extracted_requirement='argon2-cffi>=19.1.0; extra == "argon2"', scope='argon2')
INFO     Processing item: Dependency(purl='pkg:pypi/bcrypt', extracted_requirement='bcrypt; extra == "bcrypt"', scope='bcrypt')
INFO     Processing item: Django-4.1-py3-none-any.whl-extract/Django-4.1.dist-info/METADATA

I guess we could be a bit more clever by update the package assembly code upstream to not process METADATA or other Package manifest files for an extracted wheel.

Regarding the extra resource associated with the second package, I may have overlooked assigning a resource to a package somewhere. I will have to check in depth.

This is due to PythonInstalledWheelMetadataFile expecting the Resource Django-4.1-py3-none-any.whl-extract/Django-4.1.dist-info/METADATA to be in a site-packages directory when it is not. (https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/pypi.py#L320)

@JonoYang JonoYang force-pushed the implement-package-assembly branch from a56250c to aef3e07 Compare August 23, 2022 22:13
JonoYang added a commit that referenced this pull request Aug 24, 2022
    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>
This reverts commit c9b8bed.

Sorting Packages, Dependencies, and Resources from DatafileHandler.assemble() will never work. The code needs to be changed in scancode-toolkit.

Signed-off-by: Jono Yang <[email protected]>
Signed-off-by: Jono Yang <[email protected]>
    * This is so we are consistent with scancode-toolkit JSON output
    * Update expected test results

Signed-off-by: Jono Yang <[email protected]>
    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>
@JonoYang JonoYang force-pushed the implement-package-assembly branch from 13da352 to ef1e853 Compare August 25, 2022 00:06
JonoYang added a commit that referenced this pull request Aug 25, 2022
@JonoYang JonoYang force-pushed the implement-package-assembly branch from ef1e853 to bb6757d Compare August 25, 2022 01:22
@tdruez
Copy link
Contributor

tdruez commented Aug 25, 2022

@JonoYang with the latest toolkit 31.0.2 the DoesNotExist is now properly fixed.
I've encountered a new ProjectError reported aboutcode-org/scancode-toolkit#3067 but unrelated to this PR.
Merging!

@tdruez tdruez merged commit a69c563 into main Aug 25, 2022
@tdruez tdruez deleted the implement-package-assembly branch August 25, 2022 08:07
JonoYang added a commit that referenced this pull request Aug 25, 2022
    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>
tdruez added a commit that referenced this pull request Aug 31, 2022
* Implement package assembly in scancode.io #447

Signed-off-by: Jono Yang <[email protected]>

* Minor formatting changes for consistency #447

Signed-off-by: Thomas Druez <[email protected]>

* Create DiscoveredPackages before other models #447

Signed-off-by: Jono Yang <[email protected]>

* Revert "Create DiscoveredPackages before other models #447"

This reverts commit c9b8bed.

Sorting Packages, Dependencies, and Resources from DatafileHandler.assemble() will never work. The code needs to be changed in scancode-toolkit.

Signed-off-by: Jono Yang <[email protected]>

* Update migration #444

Signed-off-by: Jono Yang <[email protected]>

* Return package_uids in for_packages #444

    * This is so we are consistent with scancode-toolkit JSON output
    * Update expected test results

Signed-off-by: Jono Yang <[email protected]>

* Create directory Resources in docker pipeline #485

    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>

* Implement package assembly in scancode.io #447

Signed-off-by: Jono Yang <[email protected]>

* Implement package assembly in scancode.io #447

Signed-off-by: Jono Yang <[email protected]>

* Create DiscoveredDependency model #447

    * Create new dependency list and detail views
    * Update assemble_packages() to create DiscoveredDependencies
    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>

* Update fields on DiscoveredDependency #447

    * Remove for_package_uid and replace with ForeignKey for_package
    * Remove datafile_path and replace with ForeignKey datafile_resource
    * Create properties for the two removed fields
    * Update dependency views to link to datafile_resource
    * Update expected test results

Signed-off-by: Jono Yang <[email protected]>

* Properly pluralize verbose name #447

Signed-off-by: Jono Yang <[email protected]>

* Create new argument for create_from_data #447

    * Add strip_datafile_path_root to DiscoveredDependency.create_from_data
    * This argument strips the root path segment from `datafile_path` before using the path to look up the corresponding CodebaseResource
    * This is used in the case where we are importing a scan from scancode-toolkit, where the root path segments are not stripped by default
    * Update expected test results

Signed-off-by: Jono Yang <[email protected]>

* Update prefetch_related #447

    * Used cached_property for DiscoveredDependency properties

Signed-off-by: Jono Yang <[email protected]>

* Prefetch related models in output code #447

Signed-off-by: Jono Yang <[email protected]>

* Import scancode.io 30.2.0 scans in load_codebase

    * Order DiscoveredDependencies by is_runtime, is_optional, is_resolved, and dependency_uid
    * Do not show dependency_uid value in DiscoveredDependency list view

Signed-off-by: Jono Yang <[email protected]>

* Revert changes for importing old scancode.io scans

Signed-off-by: Jono Yang <[email protected]>

* Regen migrations for DiscoveredDependency #447

Signed-off-by: Jono Yang <[email protected]>

* Migrate DiscoveredPackage.dependencies #447

    * Create migrations to generate new DiscoveredDependency objects from DiscoveredPackage.dependencies before removing the dependencies field

Signed-off-by: Jono Yang <[email protected]>

* Update test expectations #447

Signed-off-by: Jono Yang <[email protected]>

* Remove accidentally committed files #447

Signed-off-by: Jono Yang <[email protected]>

* Update migration logic #447

    * Remove unnecessary else from DiscoveredDependency properties

Signed-off-by: Jono Yang <[email protected]>

* Add PackageURLMixin to DiscoveredDependency #447

Signed-off-by: Jono Yang <[email protected]>

* Set DiscoveredDependencies purl fields #447

    * Create migration that populates purl fields for existing DiscoveredDependencies

Signed-off-by: Jono Yang <[email protected]>

* Store purl values in purl fields #447

    * Do not store dependency_uid in purl fields

Signed-off-by: Jono Yang <[email protected]>

* Remove purl field from DiscoveredDependency #447

    * We are already storing this info in the purl fields
    * Create purl property on DiscoveredDependency for compatibility

Signed-off-by: Jono Yang <[email protected]>

* Update DependencyFilterSet #447

    * Add search and purl fields

Signed-off-by: Jono Yang <[email protected]>

* Don't show DiscoveredDependencies purl fields #447

Signed-off-by: Jono Yang <[email protected]>

* Update package detail view dependencies tab #447

Signed-off-by: Jono Yang <[email protected]>

* Add package_type to dependency serializer #511

    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>

* Update expected test results

Signed-off-by: Jono Yang <[email protected]>

* Add dependency table column #447

Signed-off-by: Jono Yang <[email protected]>

* Use tabset in dependency detail view #447

    * Add package_type property to DiscoveredDependency

Signed-off-by: Jono Yang <[email protected]>

* Update dependency list view #447

    * Use updated table header include
    * Update dependency presentation in package detail view
    * Show package uid on hover on for package tab

Signed-off-by: Jono Yang <[email protected]>

* Set DiscoveredDependency serializer fields #511

    * Update DiscoveredDependency ordering

Signed-off-by: Jono Yang <[email protected]>

* Create donut chart for package type #447

Signed-off-by: Jono Yang <[email protected]>

* Consolidate migrations #447

    * Update DiscoveredDependency ordering
    * Update daglib test expectations

Signed-off-by: Jono Yang <[email protected]>

* Update dependency JSON ordering #447

    * Update test expectations

Signed-off-by: Jono Yang <[email protected]>

* Set proper discovereddependencies related_name #447

Signed-off-by: Thomas Druez <[email protected]>

* Fix template indentation #447

Signed-off-by: Thomas Druez <[email protected]>

* Refactor update_from_data method into a UpdateFromDataMixin #447

Signed-off-by: Thomas Druez <[email protected]>

* Fix the ProjectSerializer fields #447

Signed-off-by: Thomas Druez <[email protected]>

* Fix test_scanpipe_api_project_detail unit test #447

Signed-off-by: Thomas Druez <[email protected]>

* Add HTML title for list views #506

Signed-off-by: Thomas Druez <[email protected]>

* Update dependency tabs #447

    * Only show links in dependency for_package tab or dependency datafile_resource tab if there is a value

Signed-off-by: Jono Yang <[email protected]>

* Use UpdateFromDataMixin #447

    * Use UpdateFromDataMixin in DiscoveredDependency
    * Create test for DiscoveredDependency.update_from_data()

Signed-off-by: Jono Yang <[email protected]>

* Fix formatting #447

Signed-off-by: Thomas Druez <[email protected]>

Signed-off-by: Jono Yang <[email protected]>
Signed-off-by: Thomas Druez <[email protected]>
Co-authored-by: Thomas Druez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants