-
Notifications
You must be signed in to change notification settings - Fork 55
Add versioned sincedb upgrade; use fingerprints instead of inode and/or path #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If we introduce a fingerprint instead of the path as the sincedb key we rely on two inherent properties of a file its address (inode) and a hash of some of its content - both of are not user adjustable. So in comparing two files we have this matrix:
@colinsurprenant says "if we have a sufficiently robust hashing strategy that reduces collisions significantly we can assert that two files with the same fingerprint on different inodes are the same file." Therefore the next discussion is what the robust hashing strategy should be. |
This is a hard problem. I wish I had better answers at this time :\ That said, I wonder if we can get away with just using a content fingerprint, and ignore addresses. If we assume that log files include timestamps -- something that, even for the same "log message" that the timestamp content will be different between two different log files --maybe we don't even need addresses? Given @colinsurprenant's hypothetical of a 'sufficiently robust hashing strategy' I have a feeling that we could use content hash and ignore addresses (file path, inode, device) Thoughts? It would eliminate the column portion of the matrix and just have rows (content fingerprint). |
👍 on fingerprints. In slack @jsvd, @jordansissel and myself concluded the following:
The upgrader will try to match inodes from an old sincedb with discovered files. We need to leave unmatched sincedb records in the new sincedb file as they maybe re-discovered later. Example 1: imagine we have previously read a file with 42 bytes, on disk we have one sincedb entry of Example 2: we discover a small file of less than 255 bytes, e.g. When we discover a file bigger than 32K we can pre-compute two checksums, one at 0,255 and the other at 32769,255 (or less if the file size is less than 32K + 255) When we discover a file smaller than 32K we pre-compute only one checksum at 0,255 (or less if the file size is less than 255) Example 3: we discover a very large file with a checksum of Example 4: we discover a file that is 32768m + 10 bytes. Its checksums are at 0,255 and 32768,10. At file discovery we will cache either the bytes at 0,255 or 32768,255 to make the re-compute of checksums quicker. |
@guyboertje @suyograo do we have a new estimate when this will be merged? 😃 Thanks!! |
…remove reverse from the sort.
lib/filewatch/boot_setup.rb
Outdated
FP_BYTE_SIZE = 255 | ||
FILE_READ_SIZE = 32768 | ||
SDB_EXPIRES_DAYS = 10 | ||
FIXNUM_MAX = (2**(0.size * 8 - 2) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shame that there is no constant in ruby world for that.
lib/filewatch/watched_file.rb
Outdated
@@ -1,31 +1,59 @@ | |||
require "filewatch/buftok" | |||
require 'filewatch/boot_setup' unless defined?(FileWatch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# encoding: utf-8
Really awesome work @guyboertje ! I like reading the code changes. Before moving forward I suggest we do the following.
In this PR I see a lot of changes concerning the actual test suite and they are mostly integration test, I think we should try add unit test to all the newly added classes and the ones that have changed a lot. |
Any chance of a second round of reviews? |
Any progress on this one? One year has passed and we still loose data because of inodes staying in sincedb forever... |
@LionelCons this library could ignore files if the inode changes and the file size is not changed. For logs, log files typically grow to a certain size (large) and then are never written to again. If an inode is reused, this means a new log file is created (starts with zero size and grows), so this library should see that the file size shrank compared to the old inode information and should start over reading that file. Can you open a new issue that describes your scenario (how are you writing to these files, how often are inodes reused, what does |
@jordansissel AFAIK, here is the problem we see:
|
I do have same issue as @LionelCons It seems to occur more on some filesystem settings than others: in my case a small LVM volume mounted in/var/log/xxx and formatted as Ext4. XFS seems less subject to inode recycling. |
@guyboertje I had put this on the backburner since we were pushing folks to use filebeat, but in hindsight, this was my mistake. I am open to move forward on this. Thoughts? |
This PR is way too big. I will chop it up into reasonable chunks. |
Any update on this? Can't we just hook in to some 'on file delete' event from the OS, and on deletion of a file with inode xyz the entry xyz in the sincedb is also removed? |
@guyboertje @jordansissel any plan of merging this? We are running into real issues as number of files have grown by a lot since when we started and this is causing loss of data |
@ashishpok See @guyboertje's comment: #79 (comment) |
@guyboertje @jordansissel any update? Cause there were some issues on logstash which could be fixed by this PR. |
Closing this. Simpler non-fingerprinting version in the works. |
@guyboertje Could you please tell us more about this "simpler non-fingerprinting version in the works"? |
No description provided.