Skip to content

Notion of State / Variable Expansion in Config #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
PhaedrusTheGreek opened this issue Apr 14, 2016 · 33 comments
Open

Notion of State / Variable Expansion in Config #51

PhaedrusTheGreek opened this issue Apr 14, 2016 · 33 comments

Comments

@PhaedrusTheGreek
Copy link

As explained in Event Dependent Configuration

Some of the configuration options in Logstash require the existence of fields in order to function. Because inputs generate events, there are no fields to evaluate within the input block—they do not exist yet!

Which is fair, but some plugins such as the http_poller run on intervals, and could theoretically access some state.

The end goal would be to say something like this:

With use of some globally maintained variable (not even sure the best way to do this)

url => "http://my-server.com/stats?time=%{now}"

Or by maintaining an environment variable outside of Logstash

url => "http://my-server.com/stats?time=${now}"
@webmstr
Copy link

webmstr commented Aug 19, 2016

Yup. Back to external scripts, I guess.

@jordansissel
Copy link
Contributor

jordansissel commented Aug 19, 2016

I propose the following -- using the time formatting syntax we already support with events, we could make a special handler within this plugin that allows you to say %{+<time format>} to include that time format in your URL.

Example for "now" that maybe uses unix epoch (%s in time format)

url => "http://my-server.com/stats?time=%{+%s}"

Or with today's date only:

url => "http://my-server.com/stats?time=%{+yyyy-MM-dd}"

All times would be "now" aka the HTTP request was initiated.

@jordansissel
Copy link
Contributor

@PhaedrusTheGreek @hummingV Thoughts on my proposal above? I worry it may confuse users since it uses the same syntax as event formatting, but at least it's possibly a familiar format?

I'm open to other proposals.

@PhaedrusTheGreek
Copy link
Author

@jordansissel is there any way to have global variables that could be set by a ruby filter? I know, I'm a dreamer - but, addressing only the time issue in a single plugin might result in unwanted technical debt.

@hummingV
Copy link
Contributor

hummingV commented Aug 21, 2016

@PhaedrusTheGreek , @jordansissel
Having a ruby filter to set variables is a utopia but I think that will result in a major redesign.

If the ruby filter was to stay upstream, then there is no events yet to process since input plugins are the ones responsible to create events and push it down the queue. If the ruby filter was to stay downstream, then it can set global variables for the next cycle of events but that's not a very elegant solution either. All in all, without a redesign of overall logstash event flow, forcing a ruby filter for this specific scenario will also incur technical debts of its own.

My proposal is that this decision should be based on demand. Is there sufficient demand to allow filter plugins to be able to run upstream from input plugins? Or can we get away with minimal changes for this specific requirement?
It's better to be iterative and not to invest too heavily upfront. @jordansissel proposal is exactly the same as sprintf format, so the same syntax is forward-compatible even if we decide to add ruby filters later on. i.e. for now these are plugin specific; if ruby filters becomes a reality later, the same syntax can be supported without breaking the configs.

The only limitation I see is sprintf only support current time. What if we want to use other time objects instead? Can we extend the format to be the following:
%{varname+%s}, %{varname+yyyy-MM-dd}
The varname is plugin-specific declaration. Each plugin can make some set of variables available for injection.

@PhaedrusTheGreek
Copy link
Author

@hummingV I see your point - trying to control input state by filters makes no sense.

+1 for some standard support beyond date. Has inline scripting already been ruled out?

@hummingV
Copy link
Contributor

I am open to the idea of pre and post-event hooks. That is most flexible solution imo. Having them only for this plugin might be a bit awkward though. Not sure if logstash team would want an implementation at base class level.

@jordansissel
Copy link
Contributor

For inputs, post-event things can be done in filters.

As for "pre-event" hooks, I don't think that solves this problem. The issue described in this ticket is for doing some computation to generate the URL used in the http request before each poll -- this is not necessarily "before the event" but is more "immediately before the next http request". It is possible that the http_poller could produce multiple individual events from a single http request.

As for solving this for more than just dates, I don't have any solutions available yet.

I believe we can solve the "now" and time formatting concern without needing a general solution. There's no mechanism to provide mutable state to an input for any plugin today. Some plugins will ask for the current time or may ask for a random number, but neither of those things are user-facing configurations.

Do you think we could solve the "now" problem without needing a scripting/general solution?

@PhaedrusTheGreek
Copy link
Author

Formattable %{now} does solve this particular problem for me.

@webmstr
Copy link

webmstr commented Aug 22, 2016

My use case: I'm trying to hit a URI from a remote provider. The base URL is itself static. It takes some headers (supported), but requires some dynamic time indicators in the query string (start_time, end_time). I would like to send in values like "now-5min" and "now".

Based on the other comments, I would like portions of the timestamp to be constant. In my case, I'll probably run every minute, so I would want seconds to be zero'd out (12:15:00). Depending on whether the remote provide handles dates as inclusive or exclusive, you might also need ":59", e.g. 12:15:59.

Having a mechanism to track the last time successfully processed would also be useful if it could be used as the start time for the next request (similar to the timestamp tracked by the S3 input).

@jordansissel
Copy link
Contributor

let's try this another way, instead of describing the solution ("I need current time" or "I need now-5min") can y'all tell a story about what you want to do? Maybe we can find a solution from the information in these stories.

@hummingV
Copy link
Contributor

hummingV commented Aug 22, 2016

My story:
I am using http-poller to send queries to an Elasticsearch index (say: source). In source, I have individual timestamped http requests. At the end of each day, I query aggregated statistics from source and stash it to another ElasticSearch index (say: dest). dest is an input to our monitoring tool.

I need scheduledTime variable expansion to specify date range in the query to the source. I want this range to be precise intervals (e.g: 2016-08-20T00:00:00 to 2016-08-21T00:00:00). Btw, intervals are not necessarily 1 day each; I also want other intervals such as 15 minutes, 1 hr etc. That is why I want to expand 'scheduledTime' instead of 'current system time'. Current time could be affected by latencies in java-threads. Don't get me wrong, I can afford if the query is sent a bit late but the timestamp sent in the query should be the intended scheduledTime instead of whatever time happens to be right now.

@PhaedrusTheGreek
Copy link
Author

Looking at my requirement again, I actually do need to be able to do date math as well.

Use case is Fetching jobstats from Yarn.

"http://host:port/path?startedTimeBegin=%{now-60}&finishedTimeEnd=%{now}&states=FINISHED

@hummingV
Copy link
Contributor

hummingV commented Aug 22, 2016

I should also mention that I have similar use cases for other types of source's such as SQL (jdbc plugin) and snmp.

@nothau
Copy link

nothau commented Nov 11, 2016

+1 for this too

@jordansissel
Copy link
Contributor

jordansissel commented Nov 11, 2016

The request sounds fairly complex, so let me try and restate what I am hearing. Tell me if this isn't right:

  • A way to specify that part of a given setting should come from a dynamic, external value (time, etc)
  • A way to do computation (do math on time)
  • A way to format these dynamic, computed values (time formatting)
  • Be able to do this on different inputs (or outputs, or filters?)

If this is accurate, then I am in favor of such a thing, but I'm not sure exactly how to solve it just now. It will take some effort to come up with a solution that fits well.

@CraigFoote
Copy link

+1

@z-vr
Copy link

z-vr commented Dec 2, 2016

This would be an amazing feature. We get some max_id in the response from a web-server, it'd be super good to store it in a file, so that the next request can use it as min_id parameter (papertrail), in analogy with https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#_predefined_parameters

@jordansissel
Copy link
Contributor

jordansissel commented Apr 5, 2017

Copying my response from https://discuss.elastic.co/t/response-headers-http-poller/80739/3 here:


Having the http_poller input be aware of how to paginate is something we've discussed internally. I personally feel it's not something we can achieve because of the ways that things present pagination, such as:

Fetch the main page, get a token that you can query repeatedly in the future to get the next page of results (Elasticsearch Scroll API does this)
Fetch the first page, and the response includes an identifier for the next page (Github API does this, as does your Microsoft service, from your description)
Fetch pages by number (1, 2, 3, 4) and stop fetching when the current page has no results.

There are probably more pagination strategies, and I'm not sure we can prepare the http_poller plugin to support them all.

In cases such as this, I generally recommend that a new input plugin be created that has specific knowledge of how to read data from a specific data source. In this case, it would be my recommendation to have a custom input plugin that knows how to handle the pagination strategy deployed by the Microsoft Table Service.

@txiaoyi
Copy link

txiaoyi commented Feb 7, 2018

Similar use case like url

url => "${URL}&creationdate=gt$(date --date='3 month ago' +%Y-%m-%d)"
Can runtime calculate env

@daqqad
Copy link

daqqad commented Jun 14, 2018

+1 for this. My use case is pulling logs via API calls that require an end and start timestamps.

@chrisribe
Copy link

If the poller could just work like the JDBC input plugin and have the concept of state.
https://www.elastic.co/guide/en/logstash/6.5/plugins-inputs-jdbc.html
sql_last_value has a mechanism for dealing with date start and last date in result.
https://www.elastic.co/guide/en/logstash/6.5/plugins-inputs-jdbc.html#_predefined_parameters

This would allow much easier indexing of large datasets via different api's

@daqqad
Copy link

daqqad commented Dec 13, 2018

If the poller could just work like the JDBC input plugin and have the concept of state.
https://www.elastic.co/guide/en/logstash/6.5/plugins-inputs-jdbc.html
sql_last_value has a mechanism for dealing with date start and last date in result.
https://www.elastic.co/guide/en/logstash/6.5/plugins-inputs-jdbc.html#_predefined_parameters

This would allow much easier indexing of large datasets via different api's

That wouldn't work as well because APIs might use all kinds of different ways to track state while SQL only uses standard date format or integer IDs, but it would be considerably better than nothing.

@novaksam
Copy link

I wrote my own ruby script for this situation (in my case pulling DUO events), but if we could get support in http_poller input, that would make polling time-based APIs a lot simpler. It might just be a matter of adding in a parameter for 'seconds since' or some such thing (I used a hard-coded 900 second buffer in my script).

require_relative 'duo_api'
require 'json'
require 'date'

def register(params)
	@ikey = params["IKEY"]
	@skey = params["SKEY"]
	@host = params["HOST"]
end

def filter(event)
	client= DuoApi.new(@ikey,@skey,@host)
	# Get the current time
	currenttime = DateTime.now
	#currenttime = currenttime1 - 7
	# We subtract 900 seconds from the 'now' time 
	# to allow for processes time gaps while
	# not creating excessive duplication
        # Rational is fractions, and the second number is the number of seconds in a day
	oldesttime = currenttime - Rational(900, 86400)
	# Get the authentication log files
	# The strftime('%Q') returns the value in milliseconds since the unix epoch
	resp = client.request 'GET', '/admin/v2/logs/authentication', {mintime: oldesttime.strftime('%Q'), maxtime: currenttime.strftime('%Q'), limit: '1000'}
	resp_json = JSON.parse(resp.body)
	if resp_json['response']['metadata']['total_objects'] != 0
		event.set("duo", resp_json['response'])
		return [event]
	else
		return []
	end
end

@novaksam
Copy link

Wonder if they'll let this through: #111

@hiven
Copy link

hiven commented Jun 4, 2019

Really would like this. I get data from an api and it needs relative times. I can’t figure out a way around it

@veruyandi
Copy link

In my similar case, I have to extract Dynatrace dashboard data. I'm using Dynatrace AppMon Server REST interface to generate XML reports per minute.

I'm able to parse the response using Xml filter plugin and format it with a custom Ruby script according to my need, however I also need a solution inside Logstash to get the source data periodically with an http request and I don't want to depend on a custom shell script etc. running outside of Logstash.

I can't use Last1Min filter because it is out of my control, it depends on the triggering second and can cause other issues.

an example url for 16:50- 16:51 UTC +3 is the following:

url => https://:8021/rest/management/reports/create/?filter=tf:CustomTimeframe?1572875400000:1572875460000&type=XML

@AnkurThakur
Copy link

For my use-case, I was looking for this. But unfortunately, it doesn't support it. So, I created a dirty workaround for this.

For anyone, who would like to go with any alternate: https://stackoverflow.com/a/61259006/3565756

Not relevant to HTTP Poller, but a bit relevant to Polling.

@vytsci
Copy link

vytsci commented Apr 12, 2021

Not having this leaves us with developing bridge scripts that forwards logs into logstash. We are very frustrated, because we had many push based logs and a few that required polling. We developed push based because it was easier and have not considered that polling wont have state awareness at all. Now we end up creating a hack instead of choosing Apache Nifi or similar tool. Actually without this feature collecting more advanced logs is impossible.

@HrvojeFER
Copy link

Any update on this? We're trying to fetch logs from an API that requires a timeframe for which to get the logs, so we have to update this timeframe for every request. Currently, updating the API is not an option because the API is produced by another company and the only resort is to creating a custom plugin based off of this one.

@felipefuller
Copy link

I'm with the same concern! My approach is to insert the data through a python script, nevertheless, this is a better option. Any updates? Thank you very much!

@rest17
Copy link

rest17 commented Dec 30, 2022

Is there any workaround for this. My use case is to use pull model to stream logs from loki server. We cannot use push model supported by loki grafana output plugin since that requires changes on firewall on client end.

@jezkerwin
Copy link

Is there any updates to this? I really need this feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests