-
Notifications
You must be signed in to change notification settings - Fork 100
"Do something" when the file is done being read #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what is the criteria to consider a file to have been fully processed? isn't it normal to be hitting EOF constantly as the file is being appended to? |
"done processing" currently isn't in the vocabulary of the file input. Files, today, are treated as data streams that live forever. We'll have to figure out what "done" really means. I also don't necessarily want to turn Logstash into a log rotator, since that's less about "inputting files" and more about "managing log file lifecycles" - I'm open to discussion, though. :) |
I will say, though, that many users request this kind of feature, I think there may be similar tickets elsewhere asking for things like:
We currently have no definition for "done" on an infinte stream. I wonder if the behavioral differences will make us need a new plugin to handle things like "read this file once and delete it when done"-style of things. |
perhaps done can be a set of configurable options that allow the users to specify when something is considered done. Using the below config, the file is "done" when the last event received in the watched file is older than ten minutes and when it is "done", execute done_when: 'last_event'
done_by: '10 minutes'
done_do: '/my_script.sh' |
Instead of looking and when the events are received - you can define a "done" timeframe for a log file and just look at the modified timestamp. Meaning let's say I'm starting to process a dir with lots of logs, if I have a log file that wasn't modified for 10 days, once I hit the EOF, I can assume I'm done - I shouldn't have to wait. There may also need to be other rules. eg, if I have a logs being rotated, once I finish an old log I could mark it as done after 20 minutes of inactivity - because there is another file receiving the current writes. But if there is only one file, then even after 20 minutes of inactivity, I may not want to mark it as "done" and process it. |
In my case, I want to process a lot of 1 line JSON reports that are totally static, and just have logstash monitor a folder to catch new files. What about an option |
file size might be a concern though, on a large file Logstash might get to eof before that write really finishes. |
@morallo @dev-head we have a similar thing here. Watching a relatively large directory structure where our app is continuously creating files (500k/day). Each file is written only once and is between 2k and ~1M in size. At the moment logstash isn't working stable in that scenario (at least in our setup under windows that is) which is (I assume) due to the fact that it has to keep watching all the files at the same time... |
What about |
@MarkusMayer what I did at my old job was having logrotate rename old files to .old after 1 day, so I set up file input as:
And then had files like By removing the file from logstash's watch, the resources associated with it will be freed, only the inode record will persist in sincedb. |
Won't filewatch in fact either
|
delete should close the fd and remove it from some of the datastructures, but not the sincedb https://github.com/jordansissel/ruby-filewatch/blob/master/lib/filewatch/tail.rb#L95-L102 I don't understand the second hypothesis, to better fit @MarkusMayer scenario, |
Ah, you're right! I only read the producing end in watch.rb and didn't study what actually happens when a delete request is received. The second bullet applies in the scenario you described (except that, again, the original sincedb entry won't be delete), not in Markus's. |
@jsvd @magnusbaeck thanks for your feedback and your idea jsvd. When I came across my scenario ingesting the files continously with the file plugin did not work (it just kept stopping after some time and refused to ingest new files). However at that time I used 1.5.rc2 (filed any issue elastic/logstash#2882 which apparently got solved). To my shame I have to admit that I never got around to retest it with a current release. After reading a bit on how the file plugin works I just thought that my scenario isn't what the the thing was intended to be used for and followed a completely different path. Still using file input for our other "regular" log files though. |
Absolutely, one of those key missing features. |
Exiting when 'done' is also useful for end-to-end testing of a large logstash config. It would be nice to start it up on a directory of canned data and assert the output is as we expect. |
True, but wouldn't the stdin input be pretty useful for that already? (Within a week or two I hope to open source a tool to assist with exactly that, feeding Logstash canned data and asserting that we get the expected results.) |
I've used stdin for now, there are a few issues:
I've written a few lines of Python to drive logstash over a directory of logs and assert the json output is as we would expect: |
I have the same issue described here. I used to have logstash to get the data from a file, and logrotate to handle the renaming/removal. Then, I ran into troubles if, for whatever reason, logstash dies or gets too slow. Logrotate would continue to happily rotate the input files, and that turned into events being lost. It would be great to have something like what dev_head suggested, so that when logstash finishes consuming a file, we could do an action. |
My use case is the same as other people have described - for logstash configuration testing I want to process one or more specific static files - I know ahead of time that they won't be open-ended streams. |
I wonder, instead of making the file output capable of doing a million things upon hitting EOF, should it instead emit a separate event of a particular type? Then we could use the existing arsenal of Logstash plugins to act upon that event and e.g. delete the file. In fact, we could emit events for non-EOF progress too, allowing progress feedback without having to monitor the sincedb files and correlate them to files via inode numbers. This kind of out of band metaevents could probably apply to other plugins too.
This tool is now available here: https://github.com/magnusbaeck/logstash-filter-verifier |
I think this feature belongs to the Batch File Processing requirement. |
I just commented on that tickets - #48 (comment) I think the batch use case is a simple "when you hit the EOF, do something". We don't need time limits or anything more fancy - it can be as simple as "eof_script" which points to a bash script to run (passing in the filename) when the eof if the file is hit. You could cover other common simple use cases, delete file, emit event, etc. The script is a catchall that could cover any other need and would be easy to implement (it seems). The advantage of keeping it in the file plugin is that you don't have to duplicated logic (file types, multiple lines, keep track of read location - even for single file need to keep the current pointer in case logstash dies..) I haven't considered using logstash in a long time because I wrote my own script for handling batch files. I had a need that was a little outside of my use case so I thought I'd look is Logstash 2 had fixed this. |
+1 for this! Desperately waiting for this kind of functionality. For me "logstash should exit when it finishes processing files" is most valuable. Don't get me wrong, without these features Logstash still is a kick-log tool 😄 |
This will be implemented as part of #48 as a new plugin |
Either rename the file when done processing, or add ability to specify a path for them to be moved to.
(possible options)
The text was updated successfully, but these errors were encountered: