-
Notifications
You must be signed in to change notification settings - Fork 447
add suricata files to collection #2558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @doomedraven, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request focuses on enhancing file collection and search functionalities within the Cuckoo sandbox environment. Specifically, it aims to incorporate Suricata files into the file collection process and optimize hash-based file lookups. The changes modify how file paths are retrieved and managed, particularly for files identified by their hashes, and refactors the search logic to improve efficiency and accuracy.
Highlights
- Suricata File Collection: The PR adds support for collecting files identified by Suricata, integrating them into the existing file collection mechanisms.
- Optimized Hash Search: The pull request refactors the hash search functionality to improve performance and accuracy when locating files by their hash values, leveraging the
files
collection in MongoDB. - Code Cleanup: The PR removes redundant code and simplifies file path retrieval logic, enhancing code maintainability.
Changelog
Click here to see the changelog
- dev_utils/mongo_hooks.py
- Added
suricata
files to the list of file dictionaries to be collected from reports.
- Added
- lib/cuckoo/common/web_utils.py
- Added a blank line for code formatting consistency.
- Removed unused code related to file reference searching and replaced it with a more efficient method using
_task_ids
.
- lib/cuckoo/core/database.py
- Removed unused
folders
dictionary. - Modified the file path retrieval logic to use
_task_ids
from thefiles
collection for hash-based searches. - Removed deprecated code related to Elasticsearch and older MongoDB queries.
- Removed redundant checks and streamlined the file path construction process.
- Removed code related to searching for Suricata files based on SHA256 hashes.
- Removed unused
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A hash, a path, a file's quest,
To find its home, put code to the test.
Suricata's watch,
For files to match,
In Cuckoo's nest, put code to the test.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces changes to collect Suricata files and modifies the search functionality. The changes seem reasonable, but require careful review to ensure correctness and efficiency, especially regarding the database queries and file path constructions.
Summary of Findings
- Suricata file extraction path: The code currently iterates through 'files', 'procdump', and 'CAPE' categories to construct file paths. It would be beneficial to include 'suricata' as well to ensure all relevant files are checked.
- Deprecated code: The code contains a section marked as 'deprecated code'. It's important to either remove this code entirely or provide a clear justification for keeping it, along with a plan for its eventual removal.
- MongoDB query change: The MongoDB query in
sample_path_by_hash
has been modified to search thefiles
collection instead ofanalysis
. Verify that this change aligns with the intended data retrieval strategy and that the new query is properly optimized.
Merge Readiness
The pull request introduces significant changes to file handling and database queries. While the changes seem to address the intended functionality, there are several areas of concern that need to be addressed before merging. Specifically, the MongoDB query change and the Suricata file extraction path need to be carefully reviewed. I am unable to directly approve the pull request, and users should have others review and approve this code before merging. Due to the medium
and high
severity comments, I recommend that the pull request not be merged until those are addressed (at a minimum).
if term in hash_searches: | ||
# For analyses where files have been stored in the "files" collection, search | ||
# there for the _id (i.e. sha256) of documents matching the given hash. As a | ||
# special case, we don't need to do that query if the requested hash type is | ||
# "sha256" since that's what's stored in the "file_refs" key. | ||
# We do all this in addition to search the old keys for backwards-compatibility | ||
# with documents that do not use this mechanism for storing file data. | ||
if term == "sha256": | ||
file_refs = [query_val] | ||
else: | ||
file_docs = mongo_find(FILES_COLL, {hash_searches[term]: query_val}, {"_id": 1}) | ||
file_refs = [doc["_id"] for doc in file_docs] | ||
if file_refs: | ||
if len(file_refs) > 1: | ||
query = {"$in": file_refs} | ||
else: | ||
query = file_refs[0] | ||
search_terms.extend([{f"{pfx}.{FILE_REF_KEY}": query} for pfx in NORMALIZED_FILE_FIELDS]) | ||
mongo_search_query = {"$or": search_terms} | ||
# The file details are uniq, and we store 1 to many. So where hash type is uniq, IDs are list | ||
file_docs = list(mongo_find(FILES_COLL, {hash_searches[term]: query_val}, {"_task_ids": 1})) | ||
if not file_docs: | ||
return [] | ||
ids = sorted(list(set(file_docs[0]["_task_ids"])), reverse=True)[:search_limit] | ||
term = "ids" | ||
mongo_search_query = {"info.id": {"$in": ids}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment indicates that file details are unique and stored one-to-many, suggesting that _task_ids
should contain a list of task IDs. However, the code retrieves only the first element of file_docs
(file_docs[0]
). If there are multiple file_docs
returned by the mongo_find query, this logic will only consider the _task_ids
from the first document. Consider iterating through all file_docs
to collect all unique _task_ids
.
ids = []
for file_doc in file_docs:
ids.extend(file_doc.get("_task_ids", []))
ids = sorted(list(set(ids)), reverse=True)[:search_limit]
if repconf.mongodb.enabled and web_conf.general.check_sample_in_mongodb: | ||
tasks = mongo_find( | ||
"analysis", | ||
{f"CAPE.payloads.{sizes_mongo.get(len(sample_hash), '')}": sample_hash}, | ||
{"CAPE.payloads": 1, "_id": 0, "info.id": 1}, | ||
"files", | ||
{sizes_mongo.get(len(sample_hash), ""): sample_hash}, | ||
{"_info_ids": 1, "sha256": 1}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MongoDB query has been modified to search the files
collection instead of analysis
. Verify that this change aligns with the intended data retrieval strategy and that the new query is properly optimized for the files
collection schema. Also, consider adding error handling in case the mongo_find
function fails or returns unexpected results.
# ToDo suricata path - "suricata.files.file_info.path | ||
for category in ("files", "procdump", "CAPE"): | ||
file_path = os.path.join(CUCKOO_ROOT, "storage", "analyses", str(id), category, task["sha256"]) | ||
if path_exists(file_path): | ||
sample = [file_path] | ||
break | ||
if sample: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code iterates through 'files', 'procdump', and 'CAPE' categories to construct the file path. It would be beneficial to include 'suricata' as well to ensure all relevant files are checked. Also, consider using os.path.join
for constructing the file path to ensure cross-platform compatibility.
# ToDo suricata path - "suricata.files.file_info.path | |
for category in ("files", "procdump", "CAPE"): | |
file_path = os.path.join(CUCKOO_ROOT, "storage", "analyses", str(id), category, task["sha256"]) | |
if path_exists(file_path): | |
sample = [file_path] | |
break | |
if sample: | |
break | |
for category in ("files", "procdump", "CAPE", "suricata"): | |
file_path = os.path.join(CUCKOO_ROOT, "storage", "analyses", str(id), category, task["sha256"]) | |
if path_exists(file_path): | |
sample = [file_path] | |
break |
No description provided.