-
Notifications
You must be signed in to change notification settings - Fork 698
Current handling of OTEL_EXPORTER_OTLP_TIMEOUT is not in line with spec #4044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Good catch! This affects all OTEL_EXPORTER_OTLP*_TIMEOUT envs btw |
This seems like a difficult issue to address without greatly breaking a majority of users. As of now, there isn't really an elegant way of fixing this besides from a major version bump of the otlp exporters. For maintainers, this implies also a change in release process since the otlp exporters won't be released with the same versions anymore. Open to other suggestions as well. |
Not sure, I can not really judge that. Before thinking about how to communicate a change, I'd try to clarify the actual technical consequences for users.
Please also note: Other than in other language client implementations the timeout has no influence on the total time a export may take (see #4043). Currently the timeout is applied to each export call, the exponential backoff retry is coming on top of that. This is another related aspect which totally unaligned with the spec and other language clients I had a look at. |
As discussed in the SIG a plan could be to add a python specific environment variable to delegate the choice of the adhering to the spec (ms) or keep the current behaviour (s) to the distributions. This way we buy some time upstream but downstream can act fast. And advanced users can do the conversion in autonomy. What do you think? |
Other than handling the conversion from the env var we need to handle the fact that the exporters takes the timeout in seconds as int so in order to have a fine-grain value we need to move to seconds as float or keep int but get milliseconds. |
Discussed this issue in the SIG meeting today. Here is what I propose. We are currently in 1.25.0. We release 1.26.0 normally (this should be happening in the next few days). After we release 1.26.0, we add a warning that gets triggered when someone uses the
Then we release 1.27.0 which could not cause an issue and will show users this first warning above. After releaseing 1.27.0, we change the warning above so that it will mention:
Then we release 1.28.0 which could cause an issue and will show users this second warning above. After releasing 1.28.0 we remove the warning. Then we release 1.29.0 which will show users no warning related to this issue. |
Something else, right now we use an integer to store the timeout value. It may be useful for someone if we stored that value as a float so that it can define a timeout smaller than a second (now) or a millisecond (after we fix this issue). We have decided to keep using an integer for the time being. Afte we fix this issue we may discuss again if we want to use a float instead. |
@LarsMichelsen thank you for the summary in #4044 (comment), I agree with your conclusions. In particular:
I'm concerned that will be too disruptive for us to make the change proposed by @ocelotl and fix this in a minor version. Users might ignore or otherwise not see the warnings. Some other thoughts
|
And their application will break. That will teach them a lesson. 😎 |
What I meant yesterday it that to reduce the breakage from going seconds to ms in the Exporter classes we may keep seconds in the |
I'm fine with adding a warning but to ease the conversion for distros I would add a configuration option to choose the timeout unit, explicitly e.g. Bumping the major and avoid adding more code to handle this is fine for me too. |
I also stumbled upon this while digging through the shutdown timeout issues. There are several inconsistencies which should not exist without good reasons:
I agree that this need to be thought through carefully and maybe we can come up with better approaches thinking this through further. However, I think it's important for a project to be able to do changes of such kind in some way. Otherwise historic issues will pile up over time making the project unusable at some point.
This would enable users to make the change of their configuration in a minor release and prepare for the major bump which would then change the behavior. I understand this as the main point of the proposal, right? Without an additional warning about the future incompatibility, I doubt many users would use this option. In the end, most people read change logs retrospectively in case they experience any issue. Also, even if they notice this, changing the deployments to introduce this temporary environment variable seems to be work that people would simply not do. I think most users would just do the major version bump and in the best case they read about the change in advance and would change their configuration together with the update. In many cases they won't do this in the same step, because they did not read the changelog, then run into problems, look at the changelog and then change their configuration. |
Not really, this would let upstream to keep the same behavior for a bit more time but users (including downstream distros) willing to switch will have the ability to do it.
I'm proposing a way to configure the behavior, I'm not proposing to change the behavior at the same time. |
Ah, thanks for the clarification. I guess I am just missing the broader picture to get that. Is there a place where I can read more about the greater picture, like the downstream distros you mentioned? |
We are trying to define the broader picture as there are different opinions :) Me and others participating in this project are working for companies building their own distributions (see https://opentelemetry.io/docs/concepts/distributions/) |
Yes I agree. But in the case of a minor version release, I think the user will rightfully come here complaining that we broke things. For a major version, it should be more obvious and it also signals which package might have created the breaking change. Question for maintainers: is there any downside to a major bump beyond that we need to potentially backport security fixes for some finite window? |
Follow up to the SIG discussion: @xrmx's concern is fixing this in a timely manner. @open-telemetry/python-maintainers is there any objection or downside to doing a major version bump? The only thing I can think of is that it will take us out of locksetup. |
Yes the only downside is the maintenance overhead. I think this is a good time to come with an automated process of handling packages that are not released in lockstep (especially now that sdk-extension-aws is also being added to the release cycle: open-telemetry/opentelemetry-python-contrib#1899). Currently, there is not a clean way to do this and may require a bit of complicated version management in our scripts. |
This may be the source of the confusion where the timeout are expressed in seconds: |
We discussed this at SIG 12/12, and we'll investigate two possibilities:
|
Here is an example of a behavior breaking change that we decided we would do. We viewed as being "not spec-compliant" and viewed it as a bug fix, A Here is an example of a non-spec compliant bug that is of similar nature to this one. Since there is precedence already, perhaps we can treat this as a bugfix with a warning. The longer we hold off, the more potential users this will affect. |
I agree with the actions from the last SIG meeting, and I'll help to try to find precedence and opinions from other SIGs. IMO we should update the compliance matrix first to indicate that Python is currently not compliant. This can help communicate the current state. It will probably take a while (in fact it already has) to become fully compliant, so that change can at least clear any doubts during the process. This will make all of Another thing I think we should do as soon as possible would be to propose a change to the spec here to have the default in the unit supported by those env vars mentioned below (either that or mentioning in the env vars in the next line). This would avoid future confusion, not only for contributors but also for end-users. Related to this change. Regardless of what approach is taken, it will require as much communication as technically possible. I do like the approach mentioned above regarding warning messages on startup (although perhaps leaving messages there for longer before and after the change). However, it doesn't solve the problem that, at some point, it will be a behaviour change and someone will be caught by it off guard. On the other hand, the longer we leave this not compliant, the larger the problem becomes as more people are using this env var. Of course, a major version is clearly a way to achieve wide communication of the change, but considering this is something not compliant, I'd personally prefer to avoid it, if at all possible. While we gather more input, there's another aspect that I don't know if it was discussed, and may be worth considering. In an ideal world, we'd mark something as "deprecated" or "marked for removal" while providing a new, better way of doing the same thing that solves the problem. Could this be a possibility for the work to adopt declarative config (i.e. file config) to help? My thinking (and I'm not proposing this just thinking out loud) is the following:
This may create more confusion than necessary, but thought it is worth bringing out the topic of file config, as it will come up regardless (i.e. if file config is implemented the issue with different units will appear regardless). Also, file config will be the recommended way of doing config as of OTEP 225 so something that may see adoption in any case. |
Yes, please don't break users on this one. Please mark the current env var as deprecated, move forwards with implementing config files, and add an additional |
Adding Almost as if the recommended "template" to move from env to config file for Python could just use |
Describe your environment
No response
What happened?
In
OTLPExporterMixin.__init__
the environment variableOTEL_EXPORTER_OTLP_TIMEOUT
looks like is handled as seconds while the SDK says it has to be in milliseconds (https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/#otel_exporter_otlp_timeout) and the compliance matrix says the Python implementation is compliant.The description in
opentelemetry-sdk/src/opentelemetry/sdk/environment_variables.py
does not mention the exact unit.Since the timeout is currently handed over as timeout to
grpc.experimental.unary_unary
, which is documented like this:To my understanding this only an effect in case there is a connection made, but no answer on the RPC sent back, which might not happen very often.
Steps to Reproduce
Start a
server.py
which accepts the client but does not respond:Use this test
client.py
:Preparation: Start the server
Test 1: Start the client without environment variable. It uses the default timeout of 10 seconds per try.
Test 2: Start the client with the environment variable. The time waiting for the export call is now 5 seconds instead of 10 seconds. But according to the spec it should be 5ms with this value.
Expected Result
The Python implementation is in line with the spec and the documentation is clear about the unit.
Actual Result
Setting
OTEL_EXPORTER_OTLP_TIMEOUT=1000
sets a timeout of 1000 seconds instead of 1 second.Additional context
No response
Would you like to implement a fix?
None
The text was updated successfully, but these errors were encountered: