Add versioned sincedb upgrade; use fingerprints instead of inode and/or path #79

guyboertje · 2016-03-14T17:57:11Z

No description provided.

guyboertje · 2016-03-17T15:40:32Z

If we introduce a fingerprint instead of the path as the sincedb key we rely on two inherent properties of a file its address (inode) and a hash of some of its content - both of are not user adjustable.

So in comparing two files we have this matrix:

   |   3   |   4   |
x  | A   B |   C   |
y  |   D   |   E   |

3,4 = inode
x,y = fingerprint
A,B,C,D,E = files

same:                  A <-> B (cell - both same)
different:             A <-> E, C <-> D (diagonals)
different:             A <-> D, C <-> E (columns - inode same, fp different)
same:                  A <-> C, D <-> E (rows - fp same, inode different)

different(collision)?: A <-> C, D <-> E (rows - fp same, inode different)

@colinsurprenant says "if we have a sufficiently robust hashing strategy that reduces collisions significantly we can assert that two files with the same fingerprint on different inodes are the same file."

Therefore the next discussion is what the robust hashing strategy should be.
SHA256 hash of (up to) X bytes.
Bloombroom::FNVFFI.fnv1a_64 hash of X bytes
Where X bytes is 512, one disk block etc.
We need to think of the case where a tailed file is discovered with very little content - will the chance of collisions increase?

jordansissel · 2016-03-18T10:41:11Z

This is a hard problem. I wish I had better answers at this time :\

That said, I wonder if we can get away with just using a content fingerprint, and ignore addresses. If we assume that log files include timestamps -- something that, even for the same "log message" that the timestamp content will be different between two different log files --maybe we don't even need addresses? Given @colinsurprenant's hypothetical of a 'sufficiently robust hashing strategy' I have a feeling that we could use content hash and ignore addresses (file path, inode, device)

Thoughts? It would eliminate the column portion of the matrix and just have rows (content fingerprint).

guyboertje · 2016-03-18T18:07:58Z

👍 on fingerprints.

In slack @jsvd, @jordansissel and myself concluded the following:

having a second checksum of the last block read
any empty files of files with only a header (CSV) and no content would have the same checksum initially but would get new ones as more content is read.
the checksums stored on disk (in the sinced file) would have an offset and length, so the fingerprint would be [checksum1, offset1, length1, checksum2, offset2, length2].

The upgrader will try to match inodes from an old sincedb with discovered files. We need to leave unmatched sincedb records in the new sincedb file as they maybe re-discovered later.
Each sincedb record old and new will have an expiry timestamp added of say 14 days. Expired entries are removed. A new expiry timestamp is added to watched_files that have had more data read this session than the position stored in the hash value, before they are written to disk otherwise the old expiry timestamp is written.

Example 1: imagine we have previously read a file with 42 bytes, on disk we have one sincedb entry of 245697582,0,42 42 then we discover the same file but now it has more bytes, with a checksum of 282385235,0,255, there is no direct match on checksums so we have to find one that might match. We extract a list of possible sincedb keys (checksums) that are less than 255 in length and have not been matched before and create + test checksums from the discovered file for the lengths in the list of possibles. If we have a match we add 282385235,0,255 => 42 to the sincedb hash and remove 245697582,0,42 42. Meaning that we will start reading the discovered file from byte 43.

Example 2: we discover a small file of less than 255 bytes, e.g. 532356864,0,120. If there is no direct match we only need to extract possible keys that are less than 120 in length. We repeat the same test as above.

When we discover a file bigger than 32K we can pre-compute two checksums, one at 0,255 and the other at 32769,255 (or less if the file size is less than 32K + 255)

When we discover a file smaller than 32K we pre-compute only one checksum at 0,255 (or less if the file size is less than 255)

Example 3: we discover a very large file with a checksum of 287775235,0,255. We have seen and read 44444 bytes from it before and have this on disk: 287775235,0,255 44444 457775898,32769,255. The first checksum matches and we double check whether the second part matches too, if so - we know it is the same file and read from byte 44445. If there is no match then we treat it as a new file. When there is a first match and not a second match, if the sincedb second checksum length was less than 255 we compute for that length on the discovered file.

Example 4: we discover a file that is 32768m + 10 bytes. Its checksums are at 0,255 and 32768,10.

At file discovery we will cache either the bytes at 0,255 or 32768,255 to make the re-compute of checksums quicker.

gmoskovicz · 2016-03-22T12:54:14Z

@guyboertje @suyograo do we have a new estimate when this will be merged?

😃

Thanks!!

…remove reverse from the sort.

…hed_file

ph · 2016-04-11T14:11:27Z

lib/filewatch/boot_setup.rb

+  FP_BYTE_SIZE = 255
+  FILE_READ_SIZE = 32768
+  SDB_EXPIRES_DAYS = 10
+  FIXNUM_MAX = (2**(0.size * 8 - 2) - 1)


Shame that there is no constant in ruby world for that.

ph · 2016-04-11T17:44:45Z

lib/filewatch/watched_file.rb

@@ -1,31 +1,59 @@
-require "filewatch/buftok"
+require 'filewatch/boot_setup' unless defined?(FileWatch)


# encoding: utf-8

ph · 2016-04-11T18:21:29Z

Really awesome work @guyboertje ! I like reading the code changes.

Before moving forward I suggest we do the following.

Create a .travis.yml to test with jruby and mri.
Create an appveyor for windows, I believe we can use jruby with them?

In this PR I see a lot of changes concerning the actual test suite and they are mostly integration test, I think we should try add unit test to all the newly added classes and the ones that have changed a lot.

guyboertje · 2016-06-08T11:44:28Z

Any chance of a second round of reviews?

LionelCons · 2017-06-29T07:27:21Z

Any progress on this one?

One year has passed and we still loose data because of inodes staying in sincedb forever...

jordansissel · 2017-06-29T19:11:21Z

@LionelCons this library could ignore files if the inode changes and the file size is not changed. For logs, log files typically grow to a certain size (large) and then are never written to again. If an inode is reused, this means a new log file is created (starts with zero size and grows), so this library should see that the file size shrank compared to the old inode information and should start over reading that file.

Can you open a new issue that describes your scenario (how are you writing to these files, how often are inodes reused, what does stat report for the old file vs the new file, etc)

LionelCons · 2017-07-04T06:20:47Z

@jordansissel AFAIK, here is the problem we see:

the file grows a bit (size s1) and gets rotated
the file eventually gets removed/unlinked but stays in sincedb (since there is no "purge")
the inode gets reused, sometimes with a completely different log (maybe getting much more data)
the new log file gets some data (size s2 > s1) before the library detects the roation
the library thinks it is the same file and it gives truncated information

gquintana · 2017-07-04T06:44:43Z

I do have same issue as @LionelCons

It seems to occur more on some filesystem settings than others: in my case a small LVM volume mounted in/var/log/xxx and formatted as Ext4. XFS seems less subject to inode recycling.

jordansissel · 2017-08-01T17:22:00Z

@guyboertje I had put this on the backburner since we were pushing folks to use filebeat, but in hindsight, this was my mistake. I am open to move forward on this. Thoughts?

guyboertje · 2017-08-02T18:07:36Z

This PR is way too big. I will chop it up into reasonable chunks.

johanvanderkuijl · 2017-10-18T10:07:02Z

Any update on this?
We experience also the reuse of inodes so this leads to unexpected behaviour.

Can't we just hook in to some 'on file delete' event from the OS, and on deletion of a file with inode xyz the entry xyz in the sincedb is also removed?

ashishpok · 2017-11-03T22:06:55Z

@guyboertje @jordansissel any plan of merging this? We are running into real issues as number of files have grown by a lot since when we started and this is causing loss of data

jordansissel · 2018-01-26T17:03:11Z

any plan of merging this?

@ashishpok See @guyboertje's comment: #79 (comment)

sevdog · 2018-02-16T12:02:42Z

2 Aug 2017
This PR is way too big. I will chop it up into reasonable chunks.

@guyboertje @jordansissel any update? Cause there were some issues on logstash which could be fixed by this PR.

guyboertje · 2018-03-08T17:31:54Z

Closing this. Simpler non-fingerprinting version in the works.

LionelCons · 2018-07-11T09:11:08Z

@guyboertje Could you please tell us more about this "simpler non-fingerprinting version in the works"?

Guy Boertje added 5 commits March 11, 2016 14:49

its a start

5f3ecbd

more store stuff and a comparator

76fa248

basic components and spec done - not intergrated into watch and tail yet

4f2c825

add flag to find entries loaded from sincedb only

9eb331c

WIP for review

68e1f28

suyograo assigned jordansissel Mar 15, 2016

guyboertje changed the title ~~WIP add versioned sincedb upgrade~~ Add versioned sincedb upgrade Mar 16, 2016

add to .gitignore

bbb3d50

guyboertje mentioned this pull request Mar 21, 2016

Discuss the use of fingerprints (crc32, fnv64) instead of inode or paths in the sincedb logstash-plugins/logstash-input-file#114

Open

Guy Boertje added 2 commits March 22, 2016 08:47

add basic fnv fingerprinting. fix tests.

a5eb5cd

add sincedb value and add to_a method to fingerprinter.

87ff5e1

guyboertje mentioned this pull request Mar 23, 2016

[file input] re-using inodes leads to missing/corrupted data - please implement periodic cleanup of sincedb entries logstash-plugins/logstash-input-file#111

Closed

Guy Boertje added 6 commits March 29, 2016 19:12

implement fingerprinting

c0e1a8c

remove conditional on require for main entry point

82474df

make SincedbKey2 comparable again but reverse compare by default, so …

ea8a23c

…remove reverse from the sort.

remove left over STDERR line

9a99868

in SincedbValue use flexible position handling when allocated to watc…

44b2a0d

…hed_file

oops - deallocate, also cache the second fingerprint if possible

9a27bc6

guyboertje changed the title ~~Add versioned sincedb upgrade~~ Add versioned sincedb upgrade; use fingerprints instead of inode and/or path Apr 7, 2016

ruflin mentioned this pull request Apr 11, 2016

Registry info cleanup elastic/beats#1022

Closed

ph reviewed Apr 11, 2016
View reviewed changes

ruflin mentioned this pull request Apr 22, 2016

Filebeat does not harvest rotating files in case of network drive located files elastic/beats#1361

Closed

updates after addressing 1st review comments

e586bcc

guyboertje assigned ph and jsvd Jun 8, 2016

guyboertje mentioned this pull request Jul 8, 2016

Feature-request: remove record from sincedb if file was deleted #81

Open

LionelCons mentioned this pull request Jun 29, 2017

Lost data on rotation logstash-plugins/logstash-input-file#151

Open

Guy Boertje added 2 commits February 19, 2018 19:17

skip symlinks

28aba3f

fix gradle and IDEA corrections to Java

b7b13ac

guyboertje closed this Mar 8, 2018

xo4n mentioned this pull request Nov 19, 2018

Subset of gzip input files being ignored when running in read mode logstash-plugins/logstash-input-file#220

Closed

		@@ -1,31 +1,59 @@
		require "filewatch/buftok"
		require 'filewatch/boot_setup' unless defined?(FileWatch)

Add versioned sincedb upgrade; use fingerprints instead of inode and/or path #79

Add versioned sincedb upgrade; use fingerprints instead of inode and/or path #79

Uh oh!

Conversation

guyboertje commented Mar 14, 2016

Uh oh!

guyboertje commented Mar 17, 2016

Uh oh!

jordansissel commented Mar 18, 2016

Uh oh!

guyboertje commented Mar 18, 2016

Uh oh!

gmoskovicz commented Mar 22, 2016

Uh oh!

ph Apr 11, 2016

Choose a reason for hiding this comment

Uh oh!

ph Apr 11, 2016

Choose a reason for hiding this comment

Uh oh!

ph commented Apr 11, 2016

Uh oh!

guyboertje commented Jun 8, 2016

Uh oh!

LionelCons commented Jun 29, 2017

Uh oh!

jordansissel commented Jun 29, 2017

Uh oh!

LionelCons commented Jul 4, 2017

Uh oh!

gquintana commented Jul 4, 2017

Uh oh!

jordansissel commented Aug 1, 2017

Uh oh!

guyboertje commented Aug 2, 2017

Uh oh!

johanvanderkuijl commented Oct 18, 2017

Uh oh!

ashishpok commented Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordansissel commented Jan 26, 2018

Uh oh!

sevdog commented Feb 16, 2018

Uh oh!

guyboertje commented Mar 8, 2018

Uh oh!

LionelCons commented Jul 11, 2018

Uh oh!

Uh oh!

ashishpok commented Nov 3, 2017 •

edited

Loading