Process unstructured data sources #251

pombredanne · 2020-09-10T09:17:43Z

These contain valuable data nuggets among an ocean of junk and we need to be able to find the good things there.

Some sources are:

mailing lists such as:
changelogs Parse CHANGELOGs to discover new Vulnerabilities #233
reflogs of commit (see also the commits from vulncodedb and SAP/Eclipse steady KB)
bug and issue trackers (such as Django, etc)
actual description of a CVE or the text body of advisories. See Extract interesting data from CVE and other vulnerabilities body #551

We can either automate it all, but that's going to be super difficult, or rather start to craft a curation queue and parse as much as we can to make it easy to curate by humans

Create community curation model for VulnerableCode #218

... and progressively also improve some mini AI and classification to help further automate the work.

AyanSinhaMahapatra · 2023-02-09T17:48:37Z

A reference: https://hal.science/hal-03430826/document

ThePhilosopher4097 · 2023-03-27T12:41:18Z

Interested in the Project Idea...
I think, processing of changelogs, reflogs of commits and mailing list data can be a automated

TG1999 · 2024-01-23T16:37:37Z

Please also check: https://github.com/cve-search/git-vuln-finder

TG1999 · 2024-01-23T16:44:32Z

https://github.com/pyupio/changelogs

ykodwani01 · 2024-03-15T18:48:54Z

I guess the process of change logs of Apache mailing list can be automated using OpenAI' API or other open source LLMs, where we scrape the data using Selenium, feed into LLM, get the output as json format and then update the database accordingly. What is your view on that @pombredanne . #218 Can also be implemented.

Suraj209211 · 2024-03-15T19:55:40Z

Automating the extraction of valuable information from Apache mailing list changelogs using OpenAI’s API and other tools is a great initiative and I think for the unstructured data we can focus primarily onto the Dataset for feature Engineering and classified into diverse group

Model Training: Fine-tune the selected model on a prepared dataset of CVEs in code. This will help the model learn to identify vulnerabilities in the unstructured data..... As well as we can use LoRA for the model to train

Vulnerability Detection: Use the trained model to parse through the unstructured data and identify potential vulnerabilities. This could involve using NLP techniques to understand the vulnerability descriptions and infer the vulnerable package name and versions.

Most Important Parameter to be checked is this
Text Classification: This involves categorizing text into predefined groups. [In vulnerability detection, this could be used to classify descriptions as either indicating a vulnerability or not

Information Extraction: This is the process of automatically extracting structured information from unstructured text data.

@pombredanne @AyanSinhaMahapatra

pombredanne added enhancement Data collection labels Feb 24, 2021

pombredanne mentioned this issue Sep 15, 2021

Extract interesting data from CVE and other vulnerabilities body #551

Open

pombredanne removed the enhancement label Jan 24, 2022

This comment was marked as outdated.

Sign in to view

pombredanne mentioned this issue Jun 12, 2024

Extract unpublished vulnerabilities from commit histories and trackers #1129

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process unstructured data sources #251

Process unstructured data sources #251

pombredanne commented Sep 10, 2020 •

edited

Loading

AyanSinhaMahapatra commented Feb 9, 2023

This comment was marked as outdated.

ThePhilosopher4097 commented Mar 27, 2023

TG1999 commented Jan 23, 2024

TG1999 commented Jan 23, 2024

ykodwani01 commented Mar 15, 2024 •

edited

Loading

Suraj209211 commented Mar 15, 2024 •

edited

Loading

Process unstructured data sources #251

Process unstructured data sources #251

Comments

pombredanne commented Sep 10, 2020 • edited Loading

AyanSinhaMahapatra commented Feb 9, 2023

This comment was marked as outdated.

ThePhilosopher4097 commented Mar 27, 2023

TG1999 commented Jan 23, 2024

TG1999 commented Jan 23, 2024

ykodwani01 commented Mar 15, 2024 • edited Loading

Suraj209211 commented Mar 15, 2024 • edited Loading

pombredanne commented Sep 10, 2020 •

edited

Loading

ykodwani01 commented Mar 15, 2024 •

edited

Loading

Suraj209211 commented Mar 15, 2024 •

edited

Loading