Skip to content

Map archives when their extracted directory mapped/processed #827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pombredanne opened this issue Jul 27, 2023 · 2 comments
Open

Map archives when their extracted directory mapped/processed #827

pombredanne opened this issue Jul 27, 2023 · 2 comments

Comments

@pombredanne
Copy link
Member

pombredanne commented Jul 27, 2023

In a deploy_to_devel pipeline, when I have an archive like "foo.zip", there will be a directory "foo.zip.extract" with the extracted content.

  • If "foo.zip" is matched to the purlDB then "foo.zip.extract" should be treated as matched too. This is already the case.

  • Otherwise, if not matched to the devel side and not matched to the purldb, "foo.zip" should be assigned an "extracted" status right at the end of the step that is matching archives to the purldb and this archive should not be further processed. This works because we have its extracted content that is processed otherwise.

We need to validate that:

@tdruez
Copy link
Contributor

tdruez commented Jul 31, 2023

@pombredanne Some notes looking at the data on a ScanCode.io instance:

  1. All extracted resources are properly flagged as is_archive.
  2. NOT all resources flagged as is_archive are extracted.

Resources with the following extensions are flagged as is_archive but not extracted by extractcode:

- .a
- .img
- .aifc
- .lz4
- .xz
- .exe
- .deb
- .beam
- .sym
- .swf
- .egg
- .svgz
- .res

Keeping the scripts to revisit this later:

from scanpipe.models import *

# We want to exclude filed run and run without actual local extraction)
run_qs = Run.objects.succeed().filter(pipeline_name__in=["deploy_to_develop", "docker", "scan_codebase"])
project_qs = Project.objects.filter(runs__in=run_qs)
resource_qs = CodebaseResource.objects.filter(project__in=project_qs)

extracted_root_dirs = resource_qs.filter(name__endswith="-extract")
extracted_root_dirs.count()
# 9,273
is_archive_resources = resource_qs.filter(is_archive=True)
is_archive_resources.count()
# 19,185

archives_paths = [
    extracted_dir_resource.path.removesuffix("-extract")
    for extracted_dir_resource in extracted_root_dirs
]

archives = resource_qs.filter(path__in=archives_paths)
print([resource.path for resource in archives if not resource.is_archive])
# [] -> 1. All extracted resources are properly flagged as `is_archive`

# -> 2. Let's see if all resources flagged as `is_archive` are extracted?

unextracted_extensions = set()
for resource in is_archive_resources:
    try:
        _ = resource_qs.get(
            project=resource.project,
            path=f"{resource.path}-extract",
        )
    except CodebaseResource.DoesNotExist:
        unextracted_extensions.add(resource.extension)

# -> unextracted_extensions = {'.tar', '.a', '.img', '.aifc', '.lz4', '.xz', '.exe', '.deb', '.beam', '.sym', '.bz2', '.swf', '.egg', '.svgz', '.res'}

# Let's make sure of it (we do not want to keep extraction errors here)
never_extracted_extensions = set()
for extension in unextracted_extensions:
    if not extension:  # skip the empty "" string case
        continue
    if not resource_qs.filter(path_contains=f"{extension}-extract").exists():
        never_extracted_extensions.add(extension)

AyanSinhaMahapatra added a commit that referenced this issue Aug 15, 2023
Reference: #827
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
tdruez pushed a commit that referenced this issue Aug 16, 2023
* Assign status to processed archives

Reference: #827
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Add test to check archives are assigned status

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Fix test failures

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Address review comments and add CHANGELOG entry

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

---------

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@pombredanne
Copy link
Member Author

See also :

@JonoYang wrote:

In #485, we have an issue where we get two DiscoveredPackages for the same package when we scan a pypi wheel using the scan_codebase pipeline. This is happening because we report a Package detected from the wheel itself, and then we create another Package from the extracted METADATA file from the wheel. A way to avoid this would be for scancode.io to know where archives were extracted to. This way, if we detect that an archive is a Package, then we can easily tag its extracted contents as being part of that package. Alternatively, if we detect that an extracted archive is a package itself, then we can easily tag the archive as part of the package.

It would help to be able to navigate from a resource to its extracted directory at least in the UI (and possibly also in the DB and API)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants