Skip to content

add suricata files to collection #2558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 11, 2025
Merged

add suricata files to collection #2558

merged 4 commits into from
Apr 11, 2025

Conversation

doomedraven
Copy link
Collaborator

No description provided.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @doomedraven, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request focuses on enhancing file collection and search functionalities within the Cuckoo sandbox environment. Specifically, it aims to incorporate Suricata files into the file collection process and optimize hash-based file lookups. The changes modify how file paths are retrieved and managed, particularly for files identified by their hashes, and refactors the search logic to improve efficiency and accuracy.

Highlights

  • Suricata File Collection: The PR adds support for collecting files identified by Suricata, integrating them into the existing file collection mechanisms.
  • Optimized Hash Search: The pull request refactors the hash search functionality to improve performance and accuracy when locating files by their hash values, leveraging the files collection in MongoDB.
  • Code Cleanup: The PR removes redundant code and simplifies file path retrieval logic, enhancing code maintainability.

Changelog

Click here to see the changelog
  • dev_utils/mongo_hooks.py
    • Added suricata files to the list of file dictionaries to be collected from reports.
  • lib/cuckoo/common/web_utils.py
    • Added a blank line for code formatting consistency.
    • Removed unused code related to file reference searching and replaced it with a more efficient method using _task_ids.
  • lib/cuckoo/core/database.py
    • Removed unused folders dictionary.
    • Modified the file path retrieval logic to use _task_ids from the files collection for hash-based searches.
    • Removed deprecated code related to Elasticsearch and older MongoDB queries.
    • Removed redundant checks and streamlined the file path construction process.
    • Removed code related to searching for Suricata files based on SHA256 hashes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A hash, a path, a file's quest,
To find its home, put code to the test.
Suricata's watch,
For files to match,
In Cuckoo's nest, put code to the test.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces changes to collect Suricata files and modifies the search functionality. The changes seem reasonable, but require careful review to ensure correctness and efficiency, especially regarding the database queries and file path constructions.

Summary of Findings

  • Suricata file extraction path: The code currently iterates through 'files', 'procdump', and 'CAPE' categories to construct file paths. It would be beneficial to include 'suricata' as well to ensure all relevant files are checked.
  • Deprecated code: The code contains a section marked as 'deprecated code'. It's important to either remove this code entirely or provide a clear justification for keeping it, along with a plan for its eventual removal.
  • MongoDB query change: The MongoDB query in sample_path_by_hash has been modified to search the files collection instead of analysis. Verify that this change aligns with the intended data retrieval strategy and that the new query is properly optimized.

Merge Readiness

The pull request introduces significant changes to file handling and database queries. While the changes seem to address the intended functionality, there are several areas of concern that need to be addressed before merging. Specifically, the MongoDB query change and the Suricata file extraction path need to be carefully reviewed. I am unable to directly approve the pull request, and users should have others review and approve this code before merging. Due to the medium and high severity comments, I recommend that the pull request not be merged until those are addressed (at a minimum).

Comment on lines 1425 to +1432
if term in hash_searches:
# For analyses where files have been stored in the "files" collection, search
# there for the _id (i.e. sha256) of documents matching the given hash. As a
# special case, we don't need to do that query if the requested hash type is
# "sha256" since that's what's stored in the "file_refs" key.
# We do all this in addition to search the old keys for backwards-compatibility
# with documents that do not use this mechanism for storing file data.
if term == "sha256":
file_refs = [query_val]
else:
file_docs = mongo_find(FILES_COLL, {hash_searches[term]: query_val}, {"_id": 1})
file_refs = [doc["_id"] for doc in file_docs]
if file_refs:
if len(file_refs) > 1:
query = {"$in": file_refs}
else:
query = file_refs[0]
search_terms.extend([{f"{pfx}.{FILE_REF_KEY}": query} for pfx in NORMALIZED_FILE_FIELDS])
mongo_search_query = {"$or": search_terms}
# The file details are uniq, and we store 1 to many. So where hash type is uniq, IDs are list
file_docs = list(mongo_find(FILES_COLL, {hash_searches[term]: query_val}, {"_task_ids": 1}))
if not file_docs:
return []
ids = sorted(list(set(file_docs[0]["_task_ids"])), reverse=True)[:search_limit]
term = "ids"
mongo_search_query = {"info.id": {"$in": ids}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comment indicates that file details are unique and stored one-to-many, suggesting that _task_ids should contain a list of task IDs. However, the code retrieves only the first element of file_docs (file_docs[0]). If there are multiple file_docs returned by the mongo_find query, this logic will only consider the _task_ids from the first document. Consider iterating through all file_docs to collect all unique _task_ids.

                ids = []
                for file_doc in file_docs:
                    ids.extend(file_doc.get("_task_ids", []))
                ids = sorted(list(set(ids)), reverse=True)[:search_limit]

Comment on lines +2350 to 2355
if repconf.mongodb.enabled and web_conf.general.check_sample_in_mongodb:
tasks = mongo_find(
"analysis",
{f"CAPE.payloads.{sizes_mongo.get(len(sample_hash), '')}": sample_hash},
{"CAPE.payloads": 1, "_id": 0, "info.id": 1},
"files",
{sizes_mongo.get(len(sample_hash), ""): sample_hash},
{"_info_ids": 1, "sha256": 1},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The MongoDB query has been modified to search the files collection instead of analysis. Verify that this change aligns with the intended data retrieval strategy and that the new query is properly optimized for the files collection schema. Also, consider adding error handling in case the mongo_find function fails or returns unexpected results.

Comment on lines +2370 to 2377
# ToDo suricata path - "suricata.files.file_info.path
for category in ("files", "procdump", "CAPE"):
file_path = os.path.join(CUCKOO_ROOT, "storage", "analyses", str(id), category, task["sha256"])
if path_exists(file_path):
sample = [file_path]
break
if sample:
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code iterates through 'files', 'procdump', and 'CAPE' categories to construct the file path. It would be beneficial to include 'suricata' as well to ensure all relevant files are checked. Also, consider using os.path.join for constructing the file path to ensure cross-platform compatibility.

Suggested change
# ToDo suricata path - "suricata.files.file_info.path
for category in ("files", "procdump", "CAPE"):
file_path = os.path.join(CUCKOO_ROOT, "storage", "analyses", str(id), category, task["sha256"])
if path_exists(file_path):
sample = [file_path]
break
if sample:
break
for category in ("files", "procdump", "CAPE", "suricata"):
file_path = os.path.join(CUCKOO_ROOT, "storage", "analyses", str(id), category, task["sha256"])
if path_exists(file_path):
sample = [file_path]
break

@doomedraven doomedraven merged commit 7e01a88 into master Apr 11, 2025
4 of 6 checks passed
@doomedraven doomedraven deleted the files branch April 11, 2025 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant