Fix ReadFile handler to consider the value stored in sincedb on plugin restart #307

andsel · 2022-04-01T13:22:47Z

Release notes

Fixes read mode when sincedb already stores a reference for the file not completely consumed.

What does this PR do?

Update the file pointer of a read mode file to the max between the read bytes or the sincedb reference for the same file.
This solves a problem, that when a pipeline is restarted, it's able to recover from the last known reference, without restarting from the beginning, and reprocessing already processed lines.

Why is it important/What is the impact to the user?

When a pipeline with file input in read mode is restarted, this let the plugin to recover from where it left if that information is present in the sincedb store.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files (and/or docker env variables)~~
I have added tests that prove my fix is effective or that my feature works

Author's Checklist

verify with the steps used in the bug report Ability to terminate pipeline when EOF reached #240. I used the following test file:
sample_fixture.csv.txt

Pipeline definition:

- pipeline.id: SDH_650
  pipeline.workers: 1
  pipeline.batch.size: 5
  config.string: |
    input {
        file {
            path => "/home/andrea/workspace/logstash_configs/file_input_sdh650/sample_fixture.csv"
            sincedb_path => "/home/andrea/workspace/logstash_configs/file_input_sdh650/sincedb"
            mode  => "read"
            start_position => "beginning"
        }
    }

    filter {
        csv {
            separator => ","
            columns => ["id", "host", "fqdn", "IP", "mac", "role", "type", "make", "model", "oid", "fid", "time"]
            remove_field => ["path", "host", "message", "@version" ]   
        }
        sleep {
            time => 1
            every => 10
        }
    }

    output {
        elasticsearch { 
            index => "650" 
            hosts => "http://localhost:9200"
            user => "elastic"
            password => "changeme"
        }
        stdout { codec => dots }
    }

Some curls to configure the ES output index and an aggregation query to verify:

PUT /650
{
  "mappings": {
    "properties": {
      "id":    { "type": "keyword" },  
      "host":  { "type": "text"  }, 
      "fqdn":   { "type": "text"  },
      "IP":   { "type": "text"  },
      "mac":   { "type": "text"  },
      "role":   { "type": "keyword"  },
      "type":   { "type": "keyword"  },
      "make":   { "type": "text"  },
      "model":   { "type": "text"  },
      "oid":   { "type": "text"  },
      "fid":   { "type": "text"  },
      "time":   { "type": "text"  }
    }
  }
}
DELETE 650

GET 650/_search
{
  "aggs": {
    "types": {
      "terms": { "field": "type" }
    }
  }
}

The expectation is to have 2 buckets, equally sized. Without the fix a bucket contains more documents, which means some rows was reprocessed on a pipeline reload.

How to test this PR locally

Follow step steps in #290

Related issues

Fixes unable to read the whole file when pipeline get reload #290

Use cases

Screenshots

Logs

jsvd · 2022-04-12T10:55:55Z

lib/filewatch/read_mode/handlers/read_file.rb

      if open_file(watched_file)
        add_or_update_sincedb_collection(watched_file) unless sincedb_collection.member?(watched_file.sincedb_key)
+        if sincedb_collection.member?(watched_file.sincedb_key)
+          previous_pos = sincedb_collection.find(watched_file).position
+          watched_file.file_seek([watched_file.bytes_read, previous_pos].max)
+        end
        loop do


Since add_or_update_sincedb_collection ensures the file is in the sincedb and sets the correct position from the sincedb to the watched_file, can't we simplify this a lot to just be:

Suggested change

if open_file(watched_file)

add_or_update_sincedb_collection(watched_file) unless sincedb_collection.member?(watched_file.sincedb_key)

if sincedb_collection.member?(watched_file.sincedb_key)

previous_pos = sincedb_collection.find(watched_file).position

watched_file.file_seek([watched_file.bytes_read, previous_pos].max)

end

loop do

if open_file(watched_file)

add_or_update_sincedb_collection(watched_file)

watched_file.file_seek(watched_file.bytes_read)

loop do

@jsvd good point! the semantic remains the same and the flow is simplified

It fails the test https://app.travis-ci.com/github/logstash-plugins/logstash-input-file/jobs/566809162#L974 because without the max(watched_file.bytes_read, sincedb previous_pos) it seems that bytes_read hasn't the same value of position. Need to investigate why.

Without the guard

add_or_update_sincedb_collection(watched_file) unless sincedb_collection.member?

it happens that watched_file.bytes_read is always updated to the incedb_collection.find(watched_file).position.
Now this position comes from the sinceDB and it's the last pin point.

What happens in this test? The objective of the test is to verify stripped-reads from a couple of files:
file1

string1\nstring2

file2

stringA\nstringB

Striped reads means that it reads one line for each one, so the position point \n character.
The test fails because has a file_chunk_size is 10 bytes, so the first run for the first file it reads 10 bytes and put in a buffer, which is:

string1\nst

if seek set filepointer to the filepoistion (8th bytes) then the next chuck grabbed from the file is

string2

which goes in the buffer which contained st creating the string ststring2.
This is the reason to have the max(bytes_read, last_pos).

kaisecheng

I have tested it locally. Both solutions work well to fix repeat read in restart. CI has been breaking for some time before this change.

andsel · 2022-04-27T15:10:19Z

Hold on this PR till the CI is back to green on main, then rebase and ask for review again.

lib/filewatch/read_mode/handlers/read_file.rb

… stored in sincedb

Co-authored-by: João Duarte <[email protected]>

… value

jsvd

LGTM

andsel added the bug label Apr 1, 2022

roaksoax added the status:needs-review label Apr 1, 2022

roaksoax assigned andsel Apr 6, 2022

andsel changed the title ~~Added test to verify that ReadFile handler doesn't consider the value stored in sincedb~~ Fix ReadFile handler to consider the value stored in sincedb on plugin restart Apr 7, 2022

kaisecheng self-requested a review April 11, 2022 11:40

jsvd reviewed Apr 12, 2022

View reviewed changes

kaisecheng reviewed Apr 12, 2022

View reviewed changes

andsel requested a review from jsvd April 12, 2022 15:10

jsvd removed their request for review April 28, 2022 10:05

andsel force-pushed the fix/read_mode_honor_sincedb_reference_after_a_restart branch from 8d32d95 to e4ce1b9 Compare May 2, 2022 07:20

andsel requested a review from jsvd May 6, 2022 09:42

jsvd reviewed Jun 3, 2022

View reviewed changes

lib/filewatch/read_mode/handlers/read_file.rb Outdated Show resolved Hide resolved

lib/filewatch/read_mode/handlers/read_file.rb Outdated Show resolved Hide resolved

andsel and others added 8 commits June 6, 2022 10:20

Added test to verify that ReadFile handler doesn't consider the value…

4923614

… stored in sincedb

Fixes the problem with read mode and existing sincedb reference

259f05b

Simplify the logic to update and seek the watched_file position

6fba3cd

Co-authored-by: João Duarte <[email protected]>

Avoid to seek to a position that was alread read in

a444b65

Fixed mocked DummyFileReader

17eacb9

Fixed tests moving array of one boolean value to use a single boolean…

3f9d2fa

… value

Minor, extracted method to improve readability

27b0acc

Bumped version

85cf51d

andsel force-pushed the fix/read_mode_honor_sincedb_reference_after_a_restart branch from 7669002 to 85cf51d Compare June 6, 2022 08:42

andsel requested a review from jsvd June 6, 2022 08:55

jsvd approved these changes Jun 6, 2022

View reviewed changes

roaksoax added status:approved and removed status:needs-review labels Jun 6, 2022

andsel merged commit ef9b8d5 into logstash-plugins:main Jun 6, 2022

roaksoax removed the status:approved label Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ReadFile handler to consider the value stored in sincedb on plugin restart #307

Fix ReadFile handler to consider the value stored in sincedb on plugin restart #307

andsel commented Apr 1, 2022 •

edited

Loading

jsvd Apr 12, 2022

andsel Apr 12, 2022

andsel Apr 12, 2022

andsel Apr 12, 2022 •

edited

Loading

kaisecheng left a comment

andsel commented Apr 27, 2022

jsvd left a comment

Fix ReadFile handler to consider the value stored in sincedb on plugin restart #307

Fix ReadFile handler to consider the value stored in sincedb on plugin restart #307

Conversation

andsel commented Apr 1, 2022 • edited Loading

Release notes

What does this PR do?

Why is it important/What is the impact to the user?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

jsvd Apr 12, 2022

Choose a reason for hiding this comment

andsel Apr 12, 2022

Choose a reason for hiding this comment

andsel Apr 12, 2022

Choose a reason for hiding this comment

andsel Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

kaisecheng left a comment

Choose a reason for hiding this comment

andsel commented Apr 27, 2022

jsvd left a comment

Choose a reason for hiding this comment

andsel commented Apr 1, 2022 •

edited

Loading

andsel Apr 12, 2022 •

edited

Loading