Skip to content

Use and store hashes of all imported data #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

sbs2001
Copy link
Collaborator

@sbs2001 sbs2001 commented Apr 2, 2020

This is just an experiment, I would love your opinions

Why store hashes?

One of the fundamental problems we are currently facing with VulnerableCode is the inability to handle duplicate data check #28 for more details. To address this, the following changes are made

Creation of the AdvisoryHashes model. All sorts of hashes, this includes hashes of advisory repositories, zips, individual files (JSON,YAML files etc), and most importantly the basic component of VulnerableCode, i.e individual Package-Vulnerability mappings(dictionaries to be more specific).

Added checks for this hashes in npm and ruby importers(we will be extending these).

Advantages

The obvious data duplication problem is rectified.
Another interesting advantage is the avoidance of unnecessary work, Consider an example say we encounter during our import a YAML advisory file which was previously imported, due to the identical hashes, we can just ignore this file. Without storing hashes we would've made a bunch of API calls, parsed version ranges and other computations etc.

Why xxhash

It is the fastest hashing algorithm. It is a non-cryptographic algorithm which is not a problem because we are not storing sensitive data(rather we are making data more open)

How it works?

Run our basic importers normally. Calculate the hash of each vulnerability component encountered( vulnerability component can be anything, advisory file, zip, repo etc). If this hash already exists in the db, skip this component else proceed the import and add its hash to the db.

To be figured out

How to maintain atomicity?
This boils down to when actually to make entry to db about the hash. This is workable, ideally we should be able to make the entries after the db dump.

Is the performance compromised?
There might be some minor performance for a fresh import, but for the subsequent imports, storing hashes will actually improve performance. Again I have no data to back this up.

Signed-off-by: Shivam Sandbhor <[email protected]>
@pombredanne
Copy link
Member

@sbs2001 thanks! let me review this in details and reply!



class AdvisoryHashes(models.Model):
hash = models.BigIntegerField(primary_key=True, unique=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique is redundant for primary keys.

migrations.CreateModel(
name='AdvisoryHashes',
fields=[
('hash', models.IntegerField(primary_key=True, serialize=False, unique=True)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be changed to a BigIntegerField to avoid the second migration.

@haikoschol
Copy link
Collaborator

haikoschol commented Apr 3, 2020

Storing duplicate data should be avoided by defining appropriate constraints on database columns. @sbs2001 do you see any scenarios where this approach would not work?

Hashes can be useful in other contexts, which you code touches on as well. For example, detecting whether the data from a particular source has changed at all.

In the case of Ruby advisories (and others), the data is stored in a git repository. A git repository already contains all the information about what files (if any) were added, what files (if any) were changed and even what in those files changed. So we should make use of that information instead of retrieving a zip file of HEAD.

In other cases the data might contain timestamps for creation and update of each entry. I assume proper standards like OVAL and CVRF have this.

For any kind of HTTP-based API we might be able to make use of things like ETags.

All these data source-specific approaches are supposed to be enabled by #152 , with a timestamp when this particular importer was run and a JSON column where hashes of previously seen data can be stored, and DataSource subclasses that know how to interrogate git repositories for example.

@sbs2001
Copy link
Collaborator Author

sbs2001 commented Apr 4, 2020

@haikoschol I haven't looked into your PR since it was updated. As it is going to fix the data duplication problem, I will assume storing hashes won't matter in the context of prevention of duplicate data.

As your comment suggests, to use approaches which are more specific to the kind of advisories we are dealing with. I haven't researched these approaches and don't know how complex/easy they are to implement, but they will be definitely more efficient then mindlessly storing hashes which in fact feels like a brute force solution.

I guess storing hashes is then of limited/no value.

@haikoschol
Copy link
Collaborator

@haikoschol I haven't looked into your PR since it was updated. As it is going to fix the data duplication problem, I will assume storing hashes won't matter in the context of prevention of duplicate data.

I'm not sure it will fix them right away, but should at least provide a base and make it possible to reproduce duplication issues we might run into with tests so they are easy to fix.

As your comment suggests, to use approaches which are more specific to the kind of advisories we are dealing with. I haven't researched these approaches and don't know how complex/easy they are to implement

My plan for today was to look at https://www.pygit2.org/ and see how that could be used in a subclass of DataSource to implement the API needed by ImportRunner. And then write tests for it that mock out all calls to pygit.

I didn't get to that today because getting rid of the redundant data class Package in favor of using PackageURL directly took longer than expected. If you are interested to look into this, feel free. :) I would start by playing with pygit in an IPython shell, using the Rust advisory repository. Basically figure out how to get a list of added files since timestamp $x, etc.

I guess storing hashes is then of limited/no value.

Like I said, it can still be useful for optimization or necessary in cases where we don't have this information in the data source. So we should keep this in mind once we get to those.

@sbs2001
Copy link
Collaborator Author

sbs2001 commented May 11, 2020

I have been exploring all the current APIs we interact with, the following APIs have Etags in their response body.

  1. All GitHub related stuff. This includes downloading the zip of a repo.
  2. Same with GitLab.
    3.ClearlyDefined.io also have Etags but we don't utilise their API atm

@sbs2001
Copy link
Collaborator Author

sbs2001 commented May 30, 2020

@haikoschol what do you think would be the best way to deal with changed data in case VulnerableCode instance is deployed somewhere? I don't think having a cron job of import all will work, even if we are able to deal with duplicates, what about stale and outdated(hence incorrect) data?

I was thinking about creation of a hash table in the db and adding foreign keys of the hash to Package and Vulnerability table. The hash table will also contain Etags. So during an import if a particular hash(which was in our db) is not detected, we delete the hash from db which will cascade delete all related Packages,Vulnerabilities and the linked references. We will also save all newly found hashes(and etags) and don't push the Advisory objects in the bulk_create batch whose hashes are already present

@sbs2001 sbs2001 closed this Jun 6, 2020
@haikoschol
Copy link
Collaborator

@haikoschol what do you think would be the best way to deal with changed data in case VulnerableCode instance is deployed somewhere? I don't think having a cron job of import all will work, even if we are able to deal with duplicates, what about stale and outdated(hence incorrect) data?

I was thinking about creation of a hash table in the db and adding foreign keys of the hash to Package and Vulnerability table. The hash table will also contain Etags. So during an import if a particular hash(which was in our db) is not detected, we delete the hash from db which will cascade delete all related Packages,Vulnerabilities and the linked references. We will also save all newly found hashes(and etags) and don't push the Advisory objects in the bulk_create batch whose hashes are already present

I'm not sure what exactly the cases are that would cause problems. I guess one of them would be manual curation of an advisory that then gets updated by the publishing entity and hence overwritten in the next import run. I also don't fully understand the solution you're proposing. Therefore I am in favor of closing this PR and waiting until we have more information about the problem(s) before we start solving them. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants