Skip to content

Add versioned sincedb upgrade; use fingerprints instead of inode and/or path #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

guyboertje
Copy link
Contributor

No description provided.

@guyboertje guyboertje changed the title WIP add versioned sincedb upgrade Add versioned sincedb upgrade Mar 16, 2016
@guyboertje
Copy link
Contributor Author

If we introduce a fingerprint instead of the path as the sincedb key we rely on two inherent properties of a file its address (inode) and a hash of some of its content - both of are not user adjustable.

So in comparing two files we have this matrix:

   |   3   |   4   |
x  | A   B |   C   |
y  |   D   |   E   |

3,4 = inode
x,y = fingerprint
A,B,C,D,E = files

same:                  A <-> B (cell - both same)
different:             A <-> E, C <-> D (diagonals)
different:             A <-> D, C <-> E (columns - inode same, fp different)
same:                  A <-> C, D <-> E (rows - fp same, inode different)

different(collision)?: A <-> C, D <-> E (rows - fp same, inode different)

@colinsurprenant says "if we have a sufficiently robust hashing strategy that reduces collisions significantly we can assert that two files with the same fingerprint on different inodes are the same file."

Therefore the next discussion is what the robust hashing strategy should be.
SHA256 hash of (up to) X bytes.
Bloombroom::FNVFFI.fnv1a_64 hash of X bytes
Where X bytes is 512, one disk block etc.
We need to think of the case where a tailed file is discovered with very little content - will the chance of collisions increase?

@jordansissel
Copy link
Owner

This is a hard problem. I wish I had better answers at this time :\

That said, I wonder if we can get away with just using a content fingerprint, and ignore addresses. If we assume that log files include timestamps -- something that, even for the same "log message" that the timestamp content will be different between two different log files --maybe we don't even need addresses? Given @colinsurprenant's hypothetical of a 'sufficiently robust hashing strategy' I have a feeling that we could use content hash and ignore addresses (file path, inode, device)

Thoughts? It would eliminate the column portion of the matrix and just have rows (content fingerprint).

@guyboertje
Copy link
Contributor Author

👍 on fingerprints.

In slack @jsvd, @jordansissel and myself concluded the following:

  • having a second checksum of the last block read
  • any empty files of files with only a header (CSV) and no content would have the same checksum initially but would get new ones as more content is read.
  • the checksums stored on disk (in the sinced file) would have an offset and length, so the fingerprint would be [checksum1, offset1, length1, checksum2, offset2, length2].

The upgrader will try to match inodes from an old sincedb with discovered files. We need to leave unmatched sincedb records in the new sincedb file as they maybe re-discovered later.
Each sincedb record old and new will have an expiry timestamp added of say 14 days. Expired entries are removed. A new expiry timestamp is added to watched_files that have had more data read this session than the position stored in the hash value, before they are written to disk otherwise the old expiry timestamp is written.

Example 1: imagine we have previously read a file with 42 bytes, on disk we have one sincedb entry of 245697582,0,42 42 then we discover the same file but now it has more bytes, with a checksum of 282385235,0,255, there is no direct match on checksums so we have to find one that might match. We extract a list of possible sincedb keys (checksums) that are less than 255 in length and have not been matched before and create + test checksums from the discovered file for the lengths in the list of possibles. If we have a match we add 282385235,0,255 => 42 to the sincedb hash and remove 245697582,0,42 42. Meaning that we will start reading the discovered file from byte 43.

Example 2: we discover a small file of less than 255 bytes, e.g. 532356864,0,120. If there is no direct match we only need to extract possible keys that are less than 120 in length. We repeat the same test as above.

When we discover a file bigger than 32K we can pre-compute two checksums, one at 0,255 and the other at 32769,255 (or less if the file size is less than 32K + 255)

When we discover a file smaller than 32K we pre-compute only one checksum at 0,255 (or less if the file size is less than 255)

Example 3: we discover a very large file with a checksum of 287775235,0,255. We have seen and read 44444 bytes from it before and have this on disk: 287775235,0,255 44444 457775898,32769,255. The first checksum matches and we double check whether the second part matches too, if so - we know it is the same file and read from byte 44445. If there is no match then we treat it as a new file. When there is a first match and not a second match, if the sincedb second checksum length was less than 255 we compute for that length on the discovered file.

Example 4: we discover a file that is 32768m + 10 bytes. Its checksums are at 0,255 and 32768,10.

At file discovery we will cache either the bytes at 0,255 or 32768,255 to make the re-compute of checksums quicker.

@gmoskovicz
Copy link

@guyboertje @suyograo do we have a new estimate when this will be merged?

😃

Thanks!!

@guyboertje guyboertje changed the title Add versioned sincedb upgrade Add versioned sincedb upgrade; use fingerprints instead of inode and/or path Apr 7, 2016
FP_BYTE_SIZE = 255
FILE_READ_SIZE = 32768
SDB_EXPIRES_DAYS = 10
FIXNUM_MAX = (2**(0.size * 8 - 2) - 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shame that there is no constant in ruby world for that.

@@ -1,31 +1,59 @@
require "filewatch/buftok"
require 'filewatch/boot_setup' unless defined?(FileWatch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# encoding: utf-8

@ph
Copy link
Collaborator

ph commented Apr 11, 2016

Really awesome work @guyboertje ! I like reading the code changes.

Before moving forward I suggest we do the following.

  • Create a .travis.yml to test with jruby and mri.
  • Create an appveyor for windows, I believe we can use jruby with them?

In this PR I see a lot of changes concerning the actual test suite and they are mostly integration test, I think we should try add unit test to all the newly added classes and the ones that have changed a lot.

@guyboertje guyboertje assigned ph and jsvd Jun 8, 2016
@guyboertje
Copy link
Contributor Author

Any chance of a second round of reviews?

@LionelCons
Copy link

Any progress on this one?

One year has passed and we still loose data because of inodes staying in sincedb forever...

@jordansissel
Copy link
Owner

@LionelCons this library could ignore files if the inode changes and the file size is not changed. For logs, log files typically grow to a certain size (large) and then are never written to again. If an inode is reused, this means a new log file is created (starts with zero size and grows), so this library should see that the file size shrank compared to the old inode information and should start over reading that file.

Can you open a new issue that describes your scenario (how are you writing to these files, how often are inodes reused, what does stat report for the old file vs the new file, etc)

@LionelCons
Copy link

@jordansissel AFAIK, here is the problem we see:

  • the file grows a bit (size s1) and gets rotated
  • the file eventually gets removed/unlinked but stays in sincedb (since there is no "purge")
  • the inode gets reused, sometimes with a completely different log (maybe getting much more data)
  • the new log file gets some data (size s2 > s1) before the library detects the roation
  • the library thinks it is the same file and it gives truncated information

@gquintana
Copy link

I do have same issue as @LionelCons

It seems to occur more on some filesystem settings than others: in my case a small LVM volume mounted in/var/log/xxx and formatted as Ext4. XFS seems less subject to inode recycling.

@jordansissel
Copy link
Owner

@guyboertje I had put this on the backburner since we were pushing folks to use filebeat, but in hindsight, this was my mistake. I am open to move forward on this. Thoughts?

@guyboertje
Copy link
Contributor Author

This PR is way too big. I will chop it up into reasonable chunks.

@johanvanderkuijl
Copy link

Any update on this?
We experience also the reuse of inodes so this leads to unexpected behaviour.

Can't we just hook in to some 'on file delete' event from the OS, and on deletion of a file with inode xyz the entry xyz in the sincedb is also removed?

@ashishpok
Copy link

ashishpok commented Nov 3, 2017

@guyboertje @jordansissel any plan of merging this? We are running into real issues as number of files have grown by a lot since when we started and this is causing loss of data

@jordansissel
Copy link
Owner

any plan of merging this?

@ashishpok See @guyboertje's comment: #79 (comment)

@sevdog
Copy link

sevdog commented Feb 16, 2018

2 Aug 2017
This PR is way too big. I will chop it up into reasonable chunks.

@guyboertje @jordansissel any update? Cause there were some issues on logstash which could be fixed by this PR.

@guyboertje
Copy link
Contributor Author

Closing this. Simpler non-fingerprinting version in the works.

@guyboertje guyboertje closed this Mar 8, 2018
@LionelCons
Copy link

@guyboertje Could you please tell us more about this "simpler non-fingerprinting version in the works"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants