Use and store hashes of all imported data #169

sbs2001 · 2020-04-02T18:12:42Z

This is just an experiment, I would love your opinions

Why store hashes?

One of the fundamental problems we are currently facing with VulnerableCode is the inability to handle duplicate data check #28 for more details. To address this, the following changes are made

Creation of the AdvisoryHashes model. All sorts of hashes, this includes hashes of advisory repositories, zips, individual files (JSON,YAML files etc), and most importantly the basic component of VulnerableCode, i.e individual Package-Vulnerability mappings(dictionaries to be more specific).

Added checks for this hashes in npm and ruby importers(we will be extending these).

Advantages

The obvious data duplication problem is rectified.
Another interesting advantage is the avoidance of unnecessary work, Consider an example say we encounter during our import a YAML advisory file which was previously imported, due to the identical hashes, we can just ignore this file. Without storing hashes we would've made a bunch of API calls, parsed version ranges and other computations etc.

Why xxhash

It is the fastest hashing algorithm. It is a non-cryptographic algorithm which is not a problem because we are not storing sensitive data(rather we are making data more open)

How it works?

Run our basic importers normally. Calculate the hash of each vulnerability component encountered( vulnerability component can be anything, advisory file, zip, repo etc). If this hash already exists in the db, skip this component else proceed the import and add its hash to the db.

To be figured out

How to maintain atomicity?
This boils down to when actually to make entry to db about the hash. This is workable, ideally we should be able to make the entries after the db dump.

Is the performance compromised?
There might be some minor performance for a fresh import, but for the subsequent imports, storing hashes will actually improve performance. Again I have no data to back this up.

Signed-off-by: Shivam Sandbhor <[email protected]>

pombredanne · 2020-04-02T19:36:32Z

@sbs2001 thanks! let me review this in details and reply!

haikoschol · 2020-04-03T22:01:05Z

vulnerabilities/models.py

+
+
+class AdvisoryHashes(models.Model):
+    hash = models.BigIntegerField(primary_key=True, unique=True)


unique is redundant for primary keys.

haikoschol · 2020-04-03T22:04:30Z

vulnerabilities/migrations/0002_auto_20200402_1606.py

+        migrations.CreateModel(
+            name='AdvisoryHashes',
+            fields=[
+                ('hash', models.IntegerField(primary_key=True, serialize=False, unique=True)),


This could be changed to a BigIntegerField to avoid the second migration.

haikoschol · 2020-04-03T22:39:09Z

Storing duplicate data should be avoided by defining appropriate constraints on database columns. @sbs2001 do you see any scenarios where this approach would not work?

Hashes can be useful in other contexts, which you code touches on as well. For example, detecting whether the data from a particular source has changed at all.

In the case of Ruby advisories (and others), the data is stored in a git repository. A git repository already contains all the information about what files (if any) were added, what files (if any) were changed and even what in those files changed. So we should make use of that information instead of retrieving a zip file of HEAD.

In other cases the data might contain timestamps for creation and update of each entry. I assume proper standards like OVAL and CVRF have this.

For any kind of HTTP-based API we might be able to make use of things like ETags.

All these data source-specific approaches are supposed to be enabled by #152 , with a timestamp when this particular importer was run and a JSON column where hashes of previously seen data can be stored, and DataSource subclasses that know how to interrogate git repositories for example.

sbs2001 · 2020-04-04T11:35:41Z

@haikoschol I haven't looked into your PR since it was updated. As it is going to fix the data duplication problem, I will assume storing hashes won't matter in the context of prevention of duplicate data.

As your comment suggests, to use approaches which are more specific to the kind of advisories we are dealing with. I haven't researched these approaches and don't know how complex/easy they are to implement, but they will be definitely more efficient then mindlessly storing hashes which in fact feels like a brute force solution.

I guess storing hashes is then of limited/no value.

haikoschol · 2020-04-04T16:07:28Z

@haikoschol I haven't looked into your PR since it was updated. As it is going to fix the data duplication problem, I will assume storing hashes won't matter in the context of prevention of duplicate data.

I'm not sure it will fix them right away, but should at least provide a base and make it possible to reproduce duplication issues we might run into with tests so they are easy to fix.

As your comment suggests, to use approaches which are more specific to the kind of advisories we are dealing with. I haven't researched these approaches and don't know how complex/easy they are to implement

My plan for today was to look at https://www.pygit2.org/ and see how that could be used in a subclass of DataSource to implement the API needed by ImportRunner. And then write tests for it that mock out all calls to pygit.

I didn't get to that today because getting rid of the redundant data class Package in favor of using PackageURL directly took longer than expected. If you are interested to look into this, feel free. :) I would start by playing with pygit in an IPython shell, using the Rust advisory repository. Basically figure out how to get a list of added files since timestamp $x, etc.

I guess storing hashes is then of limited/no value.

Like I said, it can still be useful for optimization or necessary in cases where we don't have this information in the data source. So we should keep this in mind once we get to those.

sbs2001 · 2020-05-11T14:16:37Z

I have been exploring all the current APIs we interact with, the following APIs have Etags in their response body.

All GitHub related stuff. This includes downloading the zip of a repo.
Same with GitLab.
3.ClearlyDefined.io also have Etags but we don't utilise their API atm

sbs2001 · 2020-05-30T09:12:48Z

@haikoschol what do you think would be the best way to deal with changed data in case VulnerableCode instance is deployed somewhere? I don't think having a cron job of import all will work, even if we are able to deal with duplicates, what about stale and outdated(hence incorrect) data?

I was thinking about creation of a hash table in the db and adding foreign keys of the hash to Package and Vulnerability table. The hash table will also contain Etags. So during an import if a particular hash(which was in our db) is not detected, we delete the hash from db which will cascade delete all related Packages,Vulnerabilities and the linked references. We will also save all newly found hashes(and etags) and don't push the Advisory objects in the bulk_create batch whose hashes are already present

haikoschol · 2020-06-06T15:57:37Z

@haikoschol what do you think would be the best way to deal with changed data in case VulnerableCode instance is deployed somewhere? I don't think having a cron job of import all will work, even if we are able to deal with duplicates, what about stale and outdated(hence incorrect) data?

I was thinking about creation of a hash table in the db and adding foreign keys of the hash to Package and Vulnerability table. The hash table will also contain Etags. So during an import if a particular hash(which was in our db) is not detected, we delete the hash from db which will cascade delete all related Packages,Vulnerabilities and the linked references. We will also save all newly found hashes(and etags) and don't push the Advisory objects in the bulk_create batch whose hashes are already present

I'm not sure what exactly the cases are that would cause problems. I guess one of them would be manual curation of an advisory that then gets updated by the publishing entity and hence overwritten in the next import run. I also don't fully understand the solution you're proposing. Therefore I am in favor of closing this PR and waiting until we have more information about the problem(s) before we start solving them. :)

Add hash arrangements in models and importers

bf07d2b

Signed-off-by: Shivam Sandbhor <[email protected]>

sbs2001 force-pushed the hash-exp branch from 877f162 to bf07d2b Compare April 2, 2020 18:17

Add migration scripts

29348ee

Signed-off-by: Shivam Sandbhor <[email protected]>

haikoschol reviewed Apr 3, 2020

View reviewed changes

sbs2001 closed this Jun 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use and store hashes of all imported data #169

Use and store hashes of all imported data #169

sbs2001 commented Apr 2, 2020 •

edited

Loading

pombredanne commented Apr 2, 2020

haikoschol Apr 3, 2020

haikoschol Apr 3, 2020

haikoschol commented Apr 3, 2020 •

edited

Loading

sbs2001 commented Apr 4, 2020 •

edited

Loading

haikoschol commented Apr 4, 2020

sbs2001 commented May 11, 2020

sbs2001 commented May 30, 2020

haikoschol commented Jun 6, 2020



		class AdvisoryHashes(models.Model):
		hash = models.BigIntegerField(primary_key=True, unique=True)

Use and store hashes of all imported data #169

Use and store hashes of all imported data #169

Conversation

sbs2001 commented Apr 2, 2020 • edited Loading

Why store hashes?

Advantages

Why xxhash

How it works?

To be figured out

pombredanne commented Apr 2, 2020

haikoschol Apr 3, 2020

Choose a reason for hiding this comment

haikoschol Apr 3, 2020

Choose a reason for hiding this comment

haikoschol commented Apr 3, 2020 • edited Loading

sbs2001 commented Apr 4, 2020 • edited Loading

haikoschol commented Apr 4, 2020

sbs2001 commented May 11, 2020

sbs2001 commented May 30, 2020

haikoschol commented Jun 6, 2020

sbs2001 commented Apr 2, 2020 •

edited

Loading

haikoschol commented Apr 3, 2020 •

edited

Loading

sbs2001 commented Apr 4, 2020 •

edited

Loading