Skip to content

Lost data on rotation #151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
PhaedrusTheGreek opened this issue Jan 10, 2017 · 5 comments
Open

Lost data on rotation #151

PhaedrusTheGreek opened this issue Jan 10, 2017 · 5 comments

Comments

@PhaedrusTheGreek
Copy link

  • Version: LS 5.1.1
  • Operating System: RHEL 7.3
  • Steps to Reproduce:
  1. Create this file /etc/logrotate.d/test:
rotate 4
/var/log/test
{
        missingok
        size 1k
        notifempty
        sharedscripts
}
  1. Create script.sh to generate some data & rotate it
for (( i=0 ; i <= 320000 ; i++ ))
do
 echo "$i this is a bunch of test data blah blah"  >> /var/log/test

 #slow it down a little
 if ! ((i % 10000)); then
  sleep 1
 fi

 if ! ((i % 100000)); then
   /usr/sbin/logrotate /etc/logrotate.d/test &
 fi
done
  1. touch /var/log/test

  2. Start Logstash with the following config.

input {
    file {
           path => "/var/log/test"
           sincedb_path => "/dev/null"
           start_position => "beginning"
           ignore_older => 0
           close_older => 5
           discover_interval => 1
    }
}

output {
 file {
   path => "/tmp/output"
   codec =>  line { format => "custom format: %{message}" }
 }
}
  1. Once logstash is started - start script.sh, and wait for it to finish

  2. compare size of logs with ingested data:

[root]# cat test* | wc -l
320001
[root]# cat /tmp/output | wc -l
318503

*** note that Filebeat tested without data loss ***

filebeat.prospectors:
- input_type: log
  paths:
    - /var/log/test
output.file:
  path: "/tmp/filebeat"
  filename: filebeat
  number_of_files: 500
@LionelCons
Copy link

Any news/progress on this one?

@jordansissel
Copy link
Contributor

This is likely due to the fact that Logstash will only look for new files when it reaches the end of the current file.

In this scenario, the script is writing 100,000 records before rotation. On my machine, this is taking roughly 11 seconds to between each logrotation (there's a sleep in your script every 10,000, so this makes sense)

This means Logstash needs to process 10,000 lines per second at least in order to keep up. If it cannot keep up, the new file checks will stall.

When I tested this, I noticed that Logstash would miss one of the rotations probably because it was running behind.

Testing the script to demonstrate this. Logstash had log.level:debug, and I see this:

% tail -f logs/logstash-plain.log| grep 'old inode'
[2017-06-28T18:20:12,998][DEBUG][logstash.inputs.file     ] each: new inode: /tmp/test.log: old inode was ["1452461", 0, 36], new is ["1452470", 0, 36]
[2017-06-28T18:20:31,506][DEBUG][logstash.inputs.file     ] each: 


Only two new rotations were observed in the above run.

inodes for the logs:

% ls -i1 /tmp/test.log*
1452472 /tmp/test.log
1452471 /tmp/test.log.1
1452470 /tmp/test.log.2
1451500 /tmp/test.log.3
1452461 /tmp/test.log.4

The first entry in our log was this:

old inode was ["1452461", 0, 36], new is ["1452470", 0, 36]

This "old" is test.log.4, and the "new" is test.log.2. You can see test.log.3 was missed here.

The second entry:

old inode was ["1452470", 0, 36], new is ["1452472", 0, 36]

The old here is test.log.2, and the new is test.log. test.log.1 was missed.


If we ask Logstash about how fast it's going using the stats api, we can see it processing (at least on my laptop here, the test environment) slower than the logs are being emitted into the log file by script.sh

% while true; do curl -s localhost:9600/_node/stats?pretty | jq .pipeline.events.out; sleep 1; done | awk 'a { print $1 - a } { a = $1 }'
8413
9142
5127
4554
9436
9127
8927
9056
9065
8434

The above numbers are roughly events-per-second. In the script.sh which writes 10,000 events per second, and rotates every 10 iterations (100,000 events), Logstash is not keeping up fast enough on my workstation and misses rotations because of the way the file input is implemented.

We can try to improve the implementation problem (file discovery happens after file reads), but ultimately, Logstash is processing slower than you are producing events. Even if we make the file discovery handled outside of the main file input thread, Logstash is still consuming slower than you are processing, which means that your logrotation will not really be saving you space, because while you might logrotate to delete test.log.4, Logstash will still be reading from it (making it not deleted) until it's hit EOF and the close_after timeout has occurred. This may or may not be a problem for you depending on how far behind Logstash is.

I would note that Logstash should be able to read files faster than this, but ultimately we cannot, in software, behave well in a scenario where: Logstash is consuming at a lower rate than the data is being produced, and the data being produced is being deleted quickly -- there will always be data loss in these scenarios or undesirable behavior (deleted files still occupying disk space because Logstash has them open with the intent to read them fully)

@jordansissel
Copy link
Contributor

Any news/progress on this one?

@LionelCons I need more details if you want a good answer. I don't know what progress you are asking about. If you believe you are encountering this issue, providing details on what you observe and also your configuration(s) would be wonderful in helping identify if your issue is the same as the one reported in this ticket.

@jordansissel
Copy link
Contributor

jordansissel commented Jun 29, 2017

note that Filebeat tested without data loss

Filebeat probably doesn't lose data here because it has file discovery ("prospecting") executing separately (in a coroutine or thread) from file reading ("harvesting"). This allows it (as I say in my comment above) to find new files during processing of existing known files.

It would still have the problem I described that a prospected-but-not-harvested file will still occupy disk space after logrotate has deleted it -- until filebeat finishes reading it and closes the file as idle.

@LionelCons
Copy link

It seems that my problem (incomplete lines read from Kafka logs) most likely comes inodes being re-used.

This seems to be a known issue: jordansissel/ruby-filewatch#79.

@jordansissel: Could you comment on this PR from your ruby-filewatch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants