add suricata files to collection #2558

doomedraven · 2025-04-10T14:56:55Z

No description provided.

gemini-code-assist

Hello @doomedraven, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request focuses on enhancing file collection and search functionalities within the Cuckoo sandbox environment. Specifically, it aims to incorporate Suricata files into the file collection process and optimize hash-based file lookups. The changes modify how file paths are retrieved and managed, particularly for files identified by their hashes, and refactors the search logic to improve efficiency and accuracy.

Highlights

Suricata File Collection: The PR adds support for collecting files identified by Suricata, integrating them into the existing file collection mechanisms.
Optimized Hash Search: The pull request refactors the hash search functionality to improve performance and accuracy when locating files by their hash values, leveraging the files collection in MongoDB.
Code Cleanup: The PR removes redundant code and simplifies file path retrieval logic, enhancing code maintainability.

Changelog

Click here to see the changelog

dev_utils/mongo_hooks.py
- Added suricata files to the list of file dictionaries to be collected from reports.
lib/cuckoo/common/web_utils.py
- Added a blank line for code formatting consistency.
- Removed unused code related to file reference searching and replaced it with a more efficient method using _task_ids.
lib/cuckoo/core/database.py
- Removed unused folders dictionary.
- Modified the file path retrieval logic to use _task_ids from the files collection for hash-based searches.
- Removed deprecated code related to Elasticsearch and older MongoDB queries.
- Removed redundant checks and streamlined the file path construction process.
- Removed code related to searching for Suricata files based on SHA256 hashes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

A hash, a path, a file's quest,
To find its home, put code to the test.
Suricata's watch,
For files to match,
In Cuckoo's nest, put code to the test.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces changes to collect Suricata files and modifies the search functionality. The changes seem reasonable, but require careful review to ensure correctness and efficiency, especially regarding the database queries and file path constructions.

Summary of Findings

Suricata file extraction path: The code currently iterates through 'files', 'procdump', and 'CAPE' categories to construct file paths. It would be beneficial to include 'suricata' as well to ensure all relevant files are checked.
Deprecated code: The code contains a section marked as 'deprecated code'. It's important to either remove this code entirely or provide a clear justification for keeping it, along with a plan for its eventual removal.
MongoDB query change: The MongoDB query in sample_path_by_hash has been modified to search the files collection instead of analysis. Verify that this change aligns with the intended data retrieval strategy and that the new query is properly optimized.

Merge Readiness

The pull request introduces significant changes to file handling and database queries. While the changes seem to address the intended functionality, there are several areas of concern that need to be addressed before merging. Specifically, the MongoDB query change and the Suricata file extraction path need to be carefully reviewed. I am unable to directly approve the pull request, and users should have others review and approve this code before merging. Due to the medium and high severity comments, I recommend that the pull request not be merged until those are addressed (at a minimum).

gemini-code-assist · 2025-04-10T14:57:20Z

lib/cuckoo/common/web_utils.py

            if term in hash_searches:
-                # For analyses where files have been stored in the "files" collection, search
-                # there for the _id (i.e. sha256) of documents matching the given hash. As a
-                # special case, we don't need to do that query if the requested hash type is
-                # "sha256" since that's what's stored in the "file_refs" key.
-                # We do all this in addition to search the old keys for backwards-compatibility
-                # with documents that do not use this mechanism for storing file data.
-                if term == "sha256":
-                    file_refs = [query_val]
-                else:
-                    file_docs = mongo_find(FILES_COLL, {hash_searches[term]: query_val}, {"_id": 1})
-                    file_refs = [doc["_id"] for doc in file_docs]
-                if file_refs:
-                    if len(file_refs) > 1:
-                        query = {"$in": file_refs}
-                    else:
-                        query = file_refs[0]
-                    search_terms.extend([{f"{pfx}.{FILE_REF_KEY}": query} for pfx in NORMALIZED_FILE_FIELDS])
-            mongo_search_query = {"$or": search_terms}
+                # The file details are uniq, and we store 1 to many. So where hash type is uniq, IDs are list
+                file_docs = list(mongo_find(FILES_COLL, {hash_searches[term]: query_val}, {"_task_ids": 1}))
+                if not file_docs:
+                    return []
+                ids = sorted(list(set(file_docs[0]["_task_ids"])), reverse=True)[:search_limit]
+                term = "ids"
+                mongo_search_query = {"info.id": {"$in": ids}}


The comment indicates that file details are unique and stored one-to-many, suggesting that _task_ids should contain a list of task IDs. However, the code retrieves only the first element of file_docs (file_docs[0]). If there are multiple file_docs returned by the mongo_find query, this logic will only consider the _task_ids from the first document. Consider iterating through all file_docs to collect all unique _task_ids.

ids = [] for file_doc in file_docs: ids.extend(file_doc.get("_task_ids", [])) ids = sorted(list(set(ids)), reverse=True)[:search_limit]

gemini-code-assist · 2025-04-10T14:57:21Z

lib/cuckoo/core/database.py

+                if repconf.mongodb.enabled and web_conf.general.check_sample_in_mongodb:
                    tasks = mongo_find(
-                        "analysis",
-                        {f"CAPE.payloads.{sizes_mongo.get(len(sample_hash), '')}": sample_hash},
-                        {"CAPE.payloads": 1, "_id": 0, "info.id": 1},
+                        "files",
+                        {sizes_mongo.get(len(sample_hash), ""): sample_hash},
+                        {"_info_ids": 1, "sha256": 1},
                    )


The MongoDB query has been modified to search the files collection instead of analysis. Verify that this change aligns with the intended data retrieval strategy and that the new query is properly optimized for the files collection schema. Also, consider adding error handling in case the mongo_find function fails or returns unexpected results.

dev_utils/mongo_hooks.py

gemini-code-assist · 2025-04-10T14:57:21Z

lib/cuckoo/core/database.py

+                            # ToDo suricata path - "suricata.files.file_info.path
+                            for category in ("files", "procdump", "CAPE"):
+                                file_path = os.path.join(CUCKOO_ROOT, "storage", "analyses", str(id), category, task["sha256"])
                                if path_exists(file_path):
                                    sample = [file_path]
                                    break
                        if sample:
                            break


The code iterates through 'files', 'procdump', and 'CAPE' categories to construct the file path. It would be beneficial to include 'suricata' as well to ensure all relevant files are checked. Also, consider using os.path.join for constructing the file path to ensure cross-platform compatibility.

Suggested change

# ToDo suricata path - "suricata.files.file_info.path

for category in ("files", "procdump", "CAPE"):

file_path = os.path.join(CUCKOO_ROOT, "storage", "analyses", str(id), category, task["sha256"])

if path_exists(file_path):

sample = [file_path]

break

if sample:

break

for category in ("files", "procdump", "CAPE", "suricata"):

file_path = os.path.join(CUCKOO_ROOT, "storage", "analyses", str(id), category, task["sha256"])

if path_exists(file_path):

sample = [file_path]

break

doomedraven added 3 commits April 10, 2025 15:15

files no backwards compatibility

30baeab

files no backwards compatibility

6f0ab27

cleanup

106f325

gemini-code-assist bot reviewed Apr 10, 2025

View reviewed changes

gemini-code-assist bot suggested changes Apr 10, 2025

View reviewed changes

files

d255193

doomedraven merged commit 7e01a88 into master Apr 11, 2025
4 of 6 checks passed

doomedraven deleted the files branch April 11, 2025 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add suricata files to collection #2558

add suricata files to collection #2558

doomedraven commented Apr 10, 2025

gemini-code-assist bot left a comment

gemini-code-assist bot left a comment

gemini-code-assist bot Apr 10, 2025

gemini-code-assist bot Apr 10, 2025

gemini-code-assist bot Apr 10, 2025

add suricata files to collection #2558

add suricata files to collection #2558

Conversation

doomedraven commented Apr 10, 2025

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

gemini-code-assist bot Apr 10, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Apr 10, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Apr 10, 2025

Choose a reason for hiding this comment