Map archives when their extracted directory mapped/processed #827

pombredanne · 2023-07-27T15:34:43Z

In a deploy_to_devel pipeline, when I have an archive like "foo.zip", there will be a directory "foo.zip.extract" with the extracted content.

If "foo.zip" is matched to the purlDB then "foo.zip.extract" should be treated as matched too. This is already the case.
Otherwise, if not matched to the devel side and not matched to the purldb, "foo.zip" should be assigned an "extracted" status right at the end of the step that is matching archives to the purldb and this archive should not be further processed. This works because we have its extracted content that is processed otherwise.

We need to validate that:

all the "is_archive" files are extracted. These files not extracted are possible errors.
we are trying to match every "is_archive" file. This need changing and is tracked in Extract all files with an is_archive flag in the d2d pipeline #826

tdruez · 2023-07-31T12:04:50Z

@pombredanne Some notes looking at the data on a ScanCode.io instance:

All extracted resources are properly flagged as is_archive.
NOT all resources flagged as is_archive are extracted.

Resources with the following extensions are flagged as is_archive but not extracted by extractcode:

- .a
- .img
- .aifc
- .lz4
- .xz
- .exe
- .deb
- .beam
- .sym
- .swf
- .egg
- .svgz
- .res

Keeping the scripts to revisit this later:

from scanpipe.models import *

# We want to exclude filed run and run without actual local extraction)
run_qs = Run.objects.succeed().filter(pipeline_name__in=["deploy_to_develop", "docker", "scan_codebase"])
project_qs = Project.objects.filter(runs__in=run_qs)
resource_qs = CodebaseResource.objects.filter(project__in=project_qs)

extracted_root_dirs = resource_qs.filter(name__endswith="-extract")
extracted_root_dirs.count()
# 9,273
is_archive_resources = resource_qs.filter(is_archive=True)
is_archive_resources.count()
# 19,185

archives_paths = [
    extracted_dir_resource.path.removesuffix("-extract")
    for extracted_dir_resource in extracted_root_dirs
]

archives = resource_qs.filter(path__in=archives_paths)
print([resource.path for resource in archives if not resource.is_archive])
# [] -> 1. All extracted resources are properly flagged as `is_archive`

# -> 2. Let's see if all resources flagged as `is_archive` are extracted?

unextracted_extensions = set()
for resource in is_archive_resources:
    try:
        _ = resource_qs.get(
            project=resource.project,
            path=f"{resource.path}-extract",
        )
    except CodebaseResource.DoesNotExist:
        unextracted_extensions.add(resource.extension)

# -> unextracted_extensions = {'.tar', '.a', '.img', '.aifc', '.lz4', '.xz', '.exe', '.deb', '.beam', '.sym', '.bz2', '.swf', '.egg', '.svgz', '.res'}

# Let's make sure of it (we do not want to keep extraction errors here)
never_extracted_extensions = set()
for extension in unextracted_extensions:
    if not extension:  # skip the empty "" string case
        continue
    if not resource_qs.filter(path_contains=f"{extension}-extract").exists():
        never_extracted_extensions.add(extension)

Reference: #827 Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

* Assign status to processed archives Reference: #827 Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Add test to check archives are assigned status Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Fix test failures Signed-off-by: Ayan Sinha Mahapatra <[email protected]> * Address review comments and add CHANGELOG entry Signed-off-by: Ayan Sinha Mahapatra <[email protected]> --------- Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

pombredanne · 2024-06-28T08:42:44Z

See also :

Add new field extracted_to to CodebaseResource #510

@JonoYang wrote:

In #485, we have an issue where we get two DiscoveredPackages for the same package when we scan a pypi wheel using the scan_codebase pipeline. This is happening because we report a Package detected from the wheel itself, and then we create another Package from the extracted METADATA file from the wheel. A way to avoid this would be for scancode.io to know where archives were extracted to. This way, if we detect that an archive is a Package, then we can easily tag its extracted contents as being part of that package. Alternatively, if we detect that an extracted archive is a package itself, then we can easily tag the archive as part of the package.

It would help to be able to navigate from a resource to its extracted directory at least in the UI (and possibly also in the DB and API)

This was referenced Aug 1, 2023

Cannot extract Lz4 file aboutcode-org/extractcode#48

Open

Cannot extract some files aboutcode-org/extractcode#49

Open

AyanSinhaMahapatra added a commit that referenced this issue Aug 15, 2023

Assign status to processed archives

2ba6fd4

Reference: #827 Signed-off-by: Ayan Sinha Mahapatra <[email protected]>

AyanSinhaMahapatra mentioned this issue Aug 15, 2023

Assign status to processed archives #861

Merged

pombredanne mentioned this issue Jun 28, 2024

Add new field extracted_to to CodebaseResource #510

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map archives when their extracted directory mapped/processed #827

Map archives when their extracted directory mapped/processed #827

pombredanne commented Jul 27, 2023 •

edited

Loading

tdruez commented Jul 31, 2023 •

edited

Loading

pombredanne commented Jun 28, 2024

Map archives when their extracted directory mapped/processed #827

Map archives when their extracted directory mapped/processed #827

Comments

pombredanne commented Jul 27, 2023 • edited Loading

tdruez commented Jul 31, 2023 • edited Loading

pombredanne commented Jun 28, 2024

pombredanne commented Jul 27, 2023 •

edited

Loading

tdruez commented Jul 31, 2023 •

edited

Loading