-
Notifications
You must be signed in to change notification settings - Fork 229
Use and store hashes of all imported data #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Shivam Sandbhor <[email protected]>
Signed-off-by: Shivam Sandbhor <[email protected]>
@sbs2001 thanks! let me review this in details and reply! |
|
||
|
||
class AdvisoryHashes(models.Model): | ||
hash = models.BigIntegerField(primary_key=True, unique=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique
is redundant for primary keys.
migrations.CreateModel( | ||
name='AdvisoryHashes', | ||
fields=[ | ||
('hash', models.IntegerField(primary_key=True, serialize=False, unique=True)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be changed to a BigIntegerField
to avoid the second migration.
Storing duplicate data should be avoided by defining appropriate constraints on database columns. @sbs2001 do you see any scenarios where this approach would not work? Hashes can be useful in other contexts, which you code touches on as well. For example, detecting whether the data from a particular source has changed at all. In the case of Ruby advisories (and others), the data is stored in a git repository. A git repository already contains all the information about what files (if any) were added, what files (if any) were changed and even what in those files changed. So we should make use of that information instead of retrieving a zip file of HEAD. In other cases the data might contain timestamps for creation and update of each entry. I assume proper standards like OVAL and CVRF have this. For any kind of HTTP-based API we might be able to make use of things like ETags. All these data source-specific approaches are supposed to be enabled by #152 , with a timestamp when this particular importer was run and a JSON column where hashes of previously seen data can be stored, and |
@haikoschol I haven't looked into your PR since it was updated. As it is going to fix the data duplication problem, I will assume storing hashes won't matter in the context of prevention of duplicate data. As your comment suggests, to use approaches which are more specific to the kind of advisories we are dealing with. I haven't researched these approaches and don't know how complex/easy they are to implement, but they will be definitely more efficient then mindlessly storing hashes which in fact feels like a brute force solution. I guess storing hashes is then of limited/no value. |
I'm not sure it will fix them right away, but should at least provide a base and make it possible to reproduce duplication issues we might run into with tests so they are easy to fix.
My plan for today was to look at https://www.pygit2.org/ and see how that could be used in a subclass of DataSource to implement the API needed by ImportRunner. And then write tests for it that mock out all calls to pygit. I didn't get to that today because getting rid of the redundant data class
Like I said, it can still be useful for optimization or necessary in cases where we don't have this information in the data source. So we should keep this in mind once we get to those. |
I have been exploring all the current APIs we interact with, the following APIs have Etags in their response body.
|
@haikoschol what do you think would be the best way to deal with changed data in case VulnerableCode instance is deployed somewhere? I don't think having a cron job of I was thinking about creation of a hash table in the db and adding foreign keys of the hash to Package and Vulnerability table. The hash table will also contain Etags. So during an import if a particular hash(which was in our db) is not detected, we delete the hash from db which will cascade delete all related Packages,Vulnerabilities and the linked references. We will also save all newly found hashes(and etags) and don't push the |
I'm not sure what exactly the cases are that would cause problems. I guess one of them would be manual curation of an advisory that then gets updated by the publishing entity and hence overwritten in the next import run. I also don't fully understand the solution you're proposing. Therefore I am in favor of closing this PR and waiting until we have more information about the problem(s) before we start solving them. :) |
This is just an experiment, I would love your opinions
Why store hashes?
One of the fundamental problems we are currently facing with VulnerableCode is the inability to handle duplicate data check #28 for more details. To address this, the following changes are made
Creation of the
AdvisoryHashes
model. All sorts of hashes, this includes hashes of advisory repositories, zips, individual files (JSON,YAML files etc), and most importantly the basic component of VulnerableCode, i.e individual Package-Vulnerability mappings(dictionaries to be more specific).Added checks for this hashes in
npm
andruby
importers(we will be extending these).Advantages
The obvious data duplication problem is rectified.
Another interesting advantage is the avoidance of unnecessary work, Consider an example say we encounter during our import a YAML advisory file which was previously imported, due to the identical hashes, we can just ignore this file. Without storing hashes we would've made a bunch of API calls, parsed version ranges and other computations etc.
Why xxhash
It is the fastest hashing algorithm. It is a non-cryptographic algorithm which is not a problem because we are not storing sensitive data(rather we are making data more open)
How it works?
Run our basic importers normally. Calculate the hash of each vulnerability component encountered( vulnerability component can be anything, advisory file, zip, repo etc). If this hash already exists in the db, skip this component else proceed the import and add its hash to the db.
To be figured out
How to maintain atomicity?
This boils down to when actually to make entry to db about the hash. This is workable, ideally we should be able to make the entries after the db dump.
Is the performance compromised?
There might be some minor performance for a fresh import, but for the subsequent imports, storing hashes will actually improve performance. Again I have no data to back this up.