Skip to content

"Do something" when the file is done being read #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
talevy opened this issue Jun 16, 2015 · 25 comments
Closed

"Do something" when the file is done being read #52

talevy opened this issue Jun 16, 2015 · 25 comments

Comments

@talevy
Copy link
Contributor

talevy commented Jun 16, 2015

Either rename the file when done processing, or add ability to specify a path for them to be moved to.

(possible options)

@jsvd
Copy link
Member

jsvd commented Jun 16, 2015

what is the criteria to consider a file to have been fully processed? isn't it normal to be hitting EOF constantly as the file is being appended to?

@jordansissel
Copy link
Contributor

"done processing" currently isn't in the vocabulary of the file input. Files, today, are treated as data streams that live forever.

We'll have to figure out what "done" really means. I also don't necessarily want to turn Logstash into a log rotator, since that's less about "inputting files" and more about "managing log file lifecycles" - I'm open to discussion, though. :)

@jordansissel
Copy link
Contributor

I will say, though, that many users request this kind of feature, I think there may be similar tickets elsewhere asking for things like:

  • "logstash should exit when it finishes processing files"
  • "logstash should delete files when done with them"
  • "logstash should close files when done with them"
  • "logstash should {some task} when done reading files"

We currently have no definition for "done" on an infinte stream. I wonder if the behavioral differences will make us need a new plugin to handle things like "read this file once and delete it when done"-style of things.

@dev-head
Copy link

dev-head commented Jul 2, 2015

perhaps done can be a set of configurable options that allow the users to specify when something is considered done.

Using the below config, the file is "done" when the last event received in the watched file is older than ten minutes and when it is "done", execute /my_script.sh and maybe pass the file path to the script automatically. This could open up some interesting new use cases.

done_when: 'last_event'
done_by: '10 minutes'
done_do: '/my_script.sh'

@jordansissel jordansissel changed the title Mark files as read when done processing "Do something" when the file is done being read Jul 7, 2015
@yehosef
Copy link

yehosef commented Jul 7, 2015

Instead of looking and when the events are received - you can define a "done" timeframe for a log file and just look at the modified timestamp. Meaning let's say I'm starting to process a dir with lots of logs, if I have a log file that wasn't modified for 10 days, once I hit the EOF, I can assume I'm done - I shouldn't have to wait.

There may also need to be other rules. eg, if I have a logs being rotated, once I finish an old log I could mark it as done after 20 minutes of inactivity - because there is another file receiving the current writes. But if there is only one file, then even after 20 minutes of inactivity, I may not want to mark it as "done" and process it.

@morallo
Copy link

morallo commented Jul 16, 2015

In my case, I want to process a lot of 1 line JSON reports that are totally static, and just have logstash monitor a folder to catch new files.

What about an option "static" => true to specify when the files are created and won't change over time? When this option is enabled, you can consider the processing done when you reach EOF.

@dev-head
Copy link

@morallo

file size might be a concern though, on a large file Logstash might get to eof before that write really finishes.

@MarkusMayer
Copy link

@morallo @dev-head we have a similar thing here. Watching a relatively large directory structure where our app is continuously creating files (500k/day). Each file is written only once and is between 2k and ~1M in size. At the moment logstash isn't working stable in that scenario (at least in our setup under windows that is) which is (I assume) due to the fact that it has to keep watching all the files at the same time...
If you could for instance have something like @morallo 's static parameter and logstash would only start ingesting files older than X seconds, or with modified timestamp older than X seconds,... this would be really cool for our use case.

@morallo
Copy link

morallo commented Jul 18, 2015

What about static and timeout parameters. Consider it done when you reach EOF if file size didn't change and no new events for timeout seconds.

@jsvd
Copy link
Member

jsvd commented Sep 8, 2015

@MarkusMayer what I did at my old job was having logrotate rename old files to .old after 1 day, so I set up file input as:

file { path => "/srv/data/*.log" }

And then had files like /srv/data/20150905.log which where being monitored by logstash, and rotated ones like /srv/data/20150901.log.old which were not.

By removing the file from logstash's watch, the resources associated with it will be freed, only the inode record will persist in sincedb.

@magnusbaeck
Copy link
Contributor

By removing the file from logstash's watch, the resources associated with it will be freed, only the inode record will persist in sincedb.

Won't filewatch in fact either

  • notice that /srv/data/20150905.log no longer exists and delete the sincedb entry, or
  • notice that the inode number of /srv/data/20150905.log doesn't match the one in sincedb and delete and recreate the entry with the current inode number?

@jsvd
Copy link
Member

jsvd commented Sep 8, 2015

delete should close the fd and remove it from some of the datastructures, but not the sincedb https://github.com/jordansissel/ruby-filewatch/blob/master/lib/filewatch/tail.rb#L95-L102

I don't understand the second hypothesis, to better fit @MarkusMayer scenario, /srv/data would have tons of
/srv/data/20150905.#{random_id}.log (multiple files a day) and rotated /srv/data/20150905.#{random_id}.log.old that don't match the file { path => "/srv/data/*.log" }

@magnusbaeck
Copy link
Contributor

Ah, you're right! I only read the producing end in watch.rb and didn't study what actually happens when a delete request is received.

The second bullet applies in the scenario you described (except that, again, the original sincedb entry won't be delete), not in Markus's.

@MarkusMayer
Copy link

@jsvd @magnusbaeck thanks for your feedback and your idea jsvd. When I came across my scenario ingesting the files continously with the file plugin did not work (it just kept stopping after some time and refused to ingest new files). However at that time I used 1.5.rc2 (filed any issue elastic/logstash#2882 which apparently got solved). To my shame I have to admit that I never got around to retest it with a current release. After reading a bit on how the file plugin works I just thought that my scenario isn't what the the thing was intended to be used for and followed a completely different path. Still using file input for our other "regular" log files though.

@syepes
Copy link

syepes commented Oct 5, 2015

Absolutely, one of those key missing features.

@jamesblackburn
Copy link

Exiting when 'done' is also useful for end-to-end testing of a large logstash config. It would be nice to start it up on a directory of canned data and assert the output is as we expect.

@magnusbaeck
Copy link
Contributor

Exiting when 'done' is also useful for end-to-end testing of a large logstash config. It would be nice to start it up on a directory of canned data and assert the output is as we expect.

True, but wouldn't the stdin input be pretty useful for that already?

(Within a week or two I hope to open source a tool to assist with exactly that, feeding Logstash canned data and asserting that we get the expected results.)

@jamesblackburn
Copy link

I've used stdin for now, there are a few issues:

  • Have to frig the type or tags differently for different input (for logstash config that exepects it)
  • Fix for config that expects a particular %{path}

I've written a few lines of Python to drive logstash over a directory of logs and assert the json output is as we would expect:
https://gist.github.com/jamesblackburn/2e895f8b843011709094

@psaiz
Copy link

psaiz commented Nov 10, 2015

I have the same issue described here. I used to have logstash to get the data from a file, and logrotate to handle the renaming/removal. Then, I ran into troubles if, for whatever reason, logstash dies or gets too slow. Logrotate would continue to happily rotate the input files, and that turned into events being lost.

It would be great to have something like what dev_head suggested, so that when logstash finishes consuming a file, we could do an action.

@mlsquires
Copy link

My use case is the same as other people have described - for logstash configuration testing I want to process one or more specific static files - I know ahead of time that they won't be open-ended streams.

@magnusbaeck
Copy link
Contributor

I wonder, instead of making the file output capable of doing a million things upon hitting EOF, should it instead emit a separate event of a particular type? Then we could use the existing arsenal of Logstash plugins to act upon that event and e.g. delete the file. In fact, we could emit events for non-EOF progress too, allowing progress feedback without having to monitor the sincedb files and correlate them to files via inode numbers. This kind of out of band metaevents could probably apply to other plugins too.

(Within a week or two I hope to open source a tool to assist with exactly that, feeding Logstash canned data and asserting that we get the expected results.)

This tool is now available here: https://github.com/magnusbaeck/logstash-filter-verifier

@guyboertje
Copy link
Contributor

I think this feature belongs to the Batch File Processing requirement.

@yehosef
Copy link

yehosef commented Jan 25, 2016

I just commented on that tickets - #48 (comment)
Where I explain why this should not be in a different plugin (IMO).

I think the batch use case is a simple "when you hit the EOF, do something". We don't need time limits or anything more fancy - it can be as simple as "eof_script" which points to a bash script to run (passing in the filename) when the eof if the file is hit. You could cover other common simple use cases, delete file, emit event, etc. The script is a catchall that could cover any other need and would be easy to implement (it seems).

The advantage of keeping it in the file plugin is that you don't have to duplicated logic (file types, multiple lines, keep track of read location - even for single file need to keep the current pointer in case logstash dies..)

I haven't considered using logstash in a long time because I wrote my own script for handling batch files. I had a need that was a little outside of my use case so I thought I'd look is Logstash 2 had fixed this.

@rodgermoore
Copy link

jordansissel commented on Jun 16, 2015

I will say, though, that many users request this kind of feature, I think there may be similar tickets elsewhere asking for things like:

  • "logstash should exit when it finishes processing files"
  • "logstash should delete files when done with them"
  • "logstash should close files when done with them"
  • "logstash should {some task} when done reading files"

+1 for this!

Desperately waiting for this kind of functionality. For me "logstash should exit when it finishes processing files" is most valuable.

Don't get me wrong, without these features Logstash still is a kick-log tool 😄

@suyograo
Copy link
Contributor

suyograo commented Apr 26, 2016

This will be implemented as part of #48 as a new plugin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests