-
Notifications
You must be signed in to change notification settings - Fork 100
Fix ReadFile handler to consider the value stored in sincedb on plugin restart #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ReadFile handler to consider the value stored in sincedb on plugin restart #307
Conversation
if open_file(watched_file) | ||
add_or_update_sincedb_collection(watched_file) unless sincedb_collection.member?(watched_file.sincedb_key) | ||
if sincedb_collection.member?(watched_file.sincedb_key) | ||
previous_pos = sincedb_collection.find(watched_file).position | ||
watched_file.file_seek([watched_file.bytes_read, previous_pos].max) | ||
end | ||
loop do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since add_or_update_sincedb_collection
ensures the file is in the sincedb and sets the correct position from the sincedb to the watched_file, can't we simplify this a lot to just be:
if open_file(watched_file) | |
add_or_update_sincedb_collection(watched_file) unless sincedb_collection.member?(watched_file.sincedb_key) | |
if sincedb_collection.member?(watched_file.sincedb_key) | |
previous_pos = sincedb_collection.find(watched_file).position | |
watched_file.file_seek([watched_file.bytes_read, previous_pos].max) | |
end | |
loop do | |
if open_file(watched_file) | |
add_or_update_sincedb_collection(watched_file) | |
watched_file.file_seek(watched_file.bytes_read) | |
loop do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsvd good point! the semantic remains the same and the flow is simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fails the test https://app.travis-ci.com/github/logstash-plugins/logstash-input-file/jobs/566809162#L974 because without the max(watched_file.bytes_read, sincedb previous_pos)
it seems that bytes_read
hasn't the same value of position. Need to investigate why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the guard
add_or_update_sincedb_collection(watched_file) unless sincedb_collection.member?
it happens that watched_file.bytes_read
is always updated to the incedb_collection.find(watched_file).position
.
Now this position
comes from the sinceDB and it's the last pin point.
What happens in this test? The objective of the test is to verify stripped-reads from a couple of files:
file1
string1\nstring2
file2
stringA\nstringB
Striped reads means that it reads one line for each one, so the position
point \n
character.
The test fails because has a file_chunk_size
is 10 bytes, so the first run for the first file it reads 10 bytes and put in a buffer, which is:
string1\nst
if seek set filepointer to the filepoistion (8th bytes) then the next chuck grabbed from the file is
string2
which goes in the buffer which contained st
creating the string ststring2
.
This is the reason to have the max(bytes_read, last_pos)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested it locally. Both solutions work well to fix repeat read in restart. CI has been breaking for some time before this change.
Hold on this PR till the CI is back to green on |
8d32d95
to
e4ce1b9
Compare
… stored in sincedb
Co-authored-by: João Duarte <[email protected]>
7669002
to
85cf51d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Release notes
Fixes read mode when sincedb already stores a reference for the file not completely consumed.
What does this PR do?
Update the file pointer of a read mode file to the max between the read bytes or the sincedb reference for the same file.
This solves a problem, that when a pipeline is restarted, it's able to recover from the last known reference, without restarting from the beginning, and reprocessing already processed lines.
Why is it important/What is the impact to the user?
When a pipeline with file input in read mode is restarted, this let the plugin to recover from where it left if that information is present in the sincedb store.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files (and/or docker env variables)Author's Checklist
sample_fixture.csv.txt
Pipeline definition:
Some curls to configure the ES output index and an aggregation query to verify:
The expectation is to have 2 buckets, equally sized. Without the fix a bucket contains more documents, which means some rows was reprocessed on a pipeline reload.
How to test this PR locally
Follow step steps in #290
Related issues
Use cases
Screenshots
Logs