Skip to content

Current handling of OTEL_EXPORTER_OTLP_TIMEOUT is not in line with spec #4044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
LarsMichelsen opened this issue Jul 11, 2024 · 24 comments
Open
Labels
bug Something isn't working

Comments

@LarsMichelsen
Copy link

Describe your environment

No response

What happened?

In OTLPExporterMixin.__init__ the environment variable OTEL_EXPORTER_OTLP_TIMEOUT looks like is handled as seconds while the SDK says it has to be in milliseconds (https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/#otel_exporter_otlp_timeout) and the compliance matrix says the Python implementation is compliant.

The description in opentelemetry-sdk/src/opentelemetry/sdk/environment_variables.py does not mention the exact unit.

Since the timeout is currently handed over as timeout to grpc.experimental.unary_unary, which is documented like this:

      timeout: An optional duration of time in seconds to allow for the RPC,
        after which an exception will be raised. If timeout is unspecified,
        defaults to a timeout controlled by the
        GRPC_PYTHON_DEFAULT_TIMEOUT_SECONDS environment variable. If that is
        unset, defaults to 60 seconds. Supply a value of None to indicate that
        no timeout should be enforced.

To my understanding this only an effect in case there is a connection made, but no answer on the RPC sent back, which might not happen very often.

Steps to Reproduce

Start a server.py which accepts the client but does not respond:

#!/usr/bin/env python3

import time
from concurrent import futures

import grpc
from opentelemetry.proto.collector.trace.v1.trace_service_pb2_grpc import add_TraceServiceServicer_to_server, TraceServiceServicer

class NoResponseTraceService(TraceServiceServicer):
    def Export(self, request, context):
        while True:
            time.sleep(1)  # Sleep forever

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
add_TraceServiceServicer_to_server(NoResponseTraceService(), server)

server.add_insecure_port('[::]:55680')
server.start()

try:
    while True:
        time.sleep(86400)
except KeyboardInterrupt:
    server.stop(0)

Use this test client.py:

#!/usr/bin/env python3

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.set_tracer_provider(TracerProvider())

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:55680", insecure=True))
)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("Hello") as span:
    span.set_attribute("language", "Python")

Preparation: Start the server

Test 1: Start the client without environment variable. It uses the default timeout of 10 seconds per try.

> python3 client.py 2>&1 | ts
Jul 11 08:55:59 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 1s.
Jul 11 08:56:10 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 2s.
Jul 11 08:56:22 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 4s.
Jul 11 08:56:36 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 8s.
Jul 11 08:56:54 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 16s.
Jul 11 08:57:20 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 32s.

Test 2: Start the client with the environment variable. The time waiting for the export call is now 5 seconds instead of 10 seconds. But according to the spec it should be 5ms with this value.

> OTEL_EXPORTER_OTLP_TIMEOUT=5 python3 client.py 2>&1 | ts
Jul 11 08:58:56 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 1s.
Jul 11 08:59:02 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 2s.
Jul 11 08:59:09 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 4s.
Jul 11 08:59:18 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 8s.
Jul 11 08:59:31 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 16s.
Jul 11 08:59:52 Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting traces to localhost:55680, retrying in 32s.

Expected Result

The Python implementation is in line with the spec and the documentation is clear about the unit.

Actual Result

Setting OTEL_EXPORTER_OTLP_TIMEOUT=1000 sets a timeout of 1000 seconds instead of 1 second.

Additional context

No response

Would you like to implement a fix?

None

@LarsMichelsen LarsMichelsen added the bug Something isn't working label Jul 11, 2024
@xrmx
Copy link
Contributor

xrmx commented Jul 11, 2024

Good catch! This affects all OTEL_EXPORTER_OTLP*_TIMEOUT envs btw

@lzchen
Copy link
Contributor

lzchen commented Jul 11, 2024

This seems like a difficult issue to address without greatly breaking a majority of users. As of now, there isn't really an elegant way of fixing this besides from a major version bump of the otlp exporters. For maintainers, this implies also a change in release process since the otlp exporters won't be released with the same versions anymore. Open to other suggestions as well.

@LarsMichelsen
Copy link
Author

LarsMichelsen commented Jul 11, 2024

This seems like a difficult issue to address without greatly breaking a majority of users.

Not sure, I can not really judge that. Before thinking about how to communicate a change, I'd try to clarify the actual technical consequences for users.

  1. Users which configured the timeout in seconds, would once interpreted as milliseconds, provide a really small timeout to the grpc channel, which could lead to timeouts in the communication (have not checked how it is implemented in grpc) which would then lead to retries which might succeed or fail as well. In the end data may be dropped.
  2. Users which configured millisecond values according to the standard documentation are currently in danger of too high timeouts leading to unexpectedly long hanging processes, which may also result in dropped data. I guess this is the less often happening situation, but may still happen.

Please also note: Other than in other language client implementations the timeout has no influence on the total time a export may take (see #4043). Currently the timeout is applied to each export call, the exponential backoff retry is coming on top of that. This is another related aspect which totally unaligned with the spec and other language clients I had a look at.

@xrmx
Copy link
Contributor

xrmx commented Jul 11, 2024

As discussed in the SIG a plan could be to add a python specific environment variable to delegate the choice of the adhering to the spec (ms) or keep the current behaviour (s) to the distributions. This way we buy some time upstream but downstream can act fast. And advanced users can do the conversion in autonomy. What do you think?

@xrmx
Copy link
Contributor

xrmx commented Jul 17, 2024

Other than handling the conversion from the env var we need to handle the fact that the exporters takes the timeout in seconds as int so in order to have a fine-grain value we need to move to seconds as float or keep int but get milliseconds.

@ocelotl
Copy link
Contributor

ocelotl commented Jul 18, 2024

Discussed this issue in the SIG meeting today.

Here is what I propose.

We are currently in 1.25.0. We release 1.26.0 normally (this should be happening in the next few days).

After we release 1.26.0, we add a warning that gets triggered when someone uses the OTEL_EXPORTER_OTLP_TIMEOUT environment variable. This warning will mention:

  1. That in release 1.27.0 the value it uses is interpreted as it being in seconds, that it should be interpreted as being in milliseconds.
  2. That in release 1.28.0 we will be making a change in the interpretation of the environment variable from seconds to milliseconds and that can cause issues for our end users.

Then we release 1.27.0 which could not cause an issue and will show users this first warning above.

After releaseing 1.27.0, we change the warning above so that it will mention:

  1. That in release 1.28.0 we changed the interpretation of the environment variable from seconds to milliseconds and that can cause issues for our end users.
  2. That users should check the value of the environment variable is in milliseconds if an issue is happening.

Then we release 1.28.0 which could cause an issue and will show users this second warning above.

After releasing 1.28.0 we remove the warning.

Then we release 1.29.0 which will show users no warning related to this issue.

@ocelotl
Copy link
Contributor

ocelotl commented Jul 18, 2024

Something else, right now we use an integer to store the timeout value. It may be useful for someone if we stored that value as a float so that it can define a timeout smaller than a second (now) or a millisecond (after we fix this issue).

We have decided to keep using an integer for the time being. Afte we fix this issue we may discuss again if we want to use a float instead.

@aabmass
Copy link
Member

aabmass commented Jul 18, 2024

@LarsMichelsen thank you for the summary in #4044 (comment), I agree with your conclusions. In particular:

  • Users which configured the timeout in seconds, would once interpreted as milliseconds, provide a really small timeout to the grpc channel, which could lead to timeouts in the communication (have not checked how it is implemented in grpc) which would then lead to retries which might succeed or fail as well. In the end data may be dropped.

I'm concerned that will be too disruptive for us to make the change proposed by @ocelotl and fix this in a minor version. Users might ignore or otherwise not see the warnings. Some other thoughts

  • Since this is only affecting the OTLP exporter and not the "core" API/SDK, wdyt of making a new major version release?
  • What are the consequences of NOT fixing the issue and keeping things as they are?

@ocelotl
Copy link
Contributor

ocelotl commented Jul 18, 2024

Users might ignore or otherwise not see the warnings.

And their application will break. That will teach them a lesson. 😎

@xrmx
Copy link
Contributor

xrmx commented Jul 19, 2024

Something else, right now we use an integer to store the timeout value. It may be useful for someone if we stored that value as a float so that it can define a timeout smaller than a second (now) or a millisecond (after we fix this issue).

We have decided to keep using an integer for the time being. Afte we fix this issue we may discuss again if we want to use a float instead.

What I meant yesterday it that to reduce the breakage from going seconds to ms in the Exporter classes we may keep seconds in the timeout parameter and switch its type to float to still being able to express the values can be set in milliseconds via the environment variables. If we change the semantics of exporter timeout parameter to milliseconds does not make sense in my opinion to move it to float.

@xrmx
Copy link
Contributor

xrmx commented Jul 19, 2024

@LarsMichelsen thank you for the summary in #4044 (comment), I agree with your conclusions. In particular:

  • Users which configured the timeout in seconds, would once interpreted as milliseconds, provide a really small timeout to the grpc channel, which could lead to timeouts in the communication (have not checked how it is implemented in grpc) which would then lead to retries which might succeed or fail as well. In the end data may be dropped.

I'm concerned that will be too disruptive for us to make the change proposed by @ocelotl and fix this in a minor version. Users might ignore or otherwise not see the warnings. Some other thoughts

* Since this is only affecting the OTLP exporter and not the "core" API/SDK, wdyt of making a new major version release?

* What are the consequences of NOT fixing the issue and keeping things as they are?

I'm fine with adding a warning but to ease the conversion for distros I would add a configuration option to choose the timeout unit, explicitly e.g. OTEL_PYTHON_EXPORTER_UNIT_TEMPORARY_COMPAT=[s|ms] or implicitly e.g. OTEL_PYTHON_EXPORTER_SECONDS_TEMPORARY_COMPAT=[true|false].

Bumping the major and avoid adding more code to handle this is fine for me too.

@LarsMichelsen
Copy link
Author

Something else, right now we use an integer to store the timeout value. It may be useful for someone if we stored that value as a float so that it can define a timeout smaller than a second (now) or a millisecond (after we fix this issue).

We have decided to keep using an integer for the time being. Afte we fix this issue we may discuss again if we want to use a float instead.

I also stumbled upon this while digging through the shutdown timeout issues. There are several inconsistencies which should not exist without good reasons:

  1. milliseconds vs. seconds
  2. timeout vs. timeout_millis naming sometimes with unit, sometimes without
  3. Float vs. int

I'm concerned that will be too disruptive for us to make the change proposed by @ocelotl and fix this in a minor version. Users might ignore or otherwise not see the warnings. Some other thoughts

  • Since this is only affecting the OTLP exporter and not the "core" API/SDK, wdyt of making a new major version release?
  • What are the consequences of NOT fixing the issue and keeping things as they are?

I agree that this need to be thought through carefully and maybe we can come up with better approaches thinking this through further. However, I think it's important for a project to be able to do changes of such kind in some way. Otherwise historic issues will pile up over time making the project unusable at some point.

I'm fine with adding a warning but to ease the conversion for distros I would add a configuration option to choose the timeout unit, explicitly e.g. OTEL_PYTHON_EXPORTER_UNIT_TEMPORARY_COMPAT=[s|ms] or implicitly e.g. OTEL_PYTHON_EXPORTER_SECONDS_TEMPORARY_COMPAT=[true|false].

This would enable users to make the change of their configuration in a minor release and prepare for the major bump which would then change the behavior. I understand this as the main point of the proposal, right? Without an additional warning about the future incompatibility, I doubt many users would use this option. In the end, most people read change logs retrospectively in case they experience any issue.

Also, even if they notice this, changing the deployments to introduce this temporary environment variable seems to be work that people would simply not do.

I think most users would just do the major version bump and in the best case they read about the change in advance and would change their configuration together with the update. In many cases they won't do this in the same step, because they did not read the changelog, then run into problems, look at the changelog and then change their configuration.

@xrmx
Copy link
Contributor

xrmx commented Jul 19, 2024

I'm fine with adding a warning but to ease the conversion for distros I would add a configuration option to choose the timeout unit, explicitly e.g. OTEL_PYTHON_EXPORTER_UNIT_TEMPORARY_COMPAT=[s|ms] or implicitly e.g. OTEL_PYTHON_EXPORTER_SECONDS_TEMPORARY_COMPAT=[true|false].

This would enable users to make the change of their configuration in a minor release and prepare for the major bump which would then change the behavior. I understand this as the main point of the proposal, right? Without an additional warning about the future incompatibility, I doubt many users would use this option. In the end, most people read change logs retrospectively in case they experience any issue.

Also, even if they notice this, changing the deployments to introduce this temporary environment variable seems to be work that people would simply not do.

Not really, this would let upstream to keep the same behavior for a bit more time but users (including downstream distros) willing to switch will have the ability to do it.

I think most users would just do the major version bump and in the best case they read about the change in advance and would change their configuration together with the update. In many cases they won't do this in the same step, because they did not read the changelog, then run into problems, look at the changelog and then change their configuration.

I'm proposing a way to configure the behavior, I'm not proposing to change the behavior at the same time.

@LarsMichelsen
Copy link
Author

Not really, this would let upstream to keep the same behavior for a bit more time but users (including downstream distros) willing to switch will have the ability to do it.

Ah, thanks for the clarification. I guess I am just missing the broader picture to get that. Is there a place where I can read more about the greater picture, like the downstream distros you mentioned?

@xrmx
Copy link
Contributor

xrmx commented Jul 19, 2024

Not really, this would let upstream to keep the same behavior for a bit more time but users (including downstream distros) willing to switch will have the ability to do it.

Ah, thanks for the clarification. I guess I am just missing the broader picture to get that. Is there a place where I can read more about the greater picture, like the downstream distros you mentioned?

We are trying to define the broader picture as there are different opinions :)

Me and others participating in this project are working for companies building their own distributions (see https://opentelemetry.io/docs/concepts/distributions/)

@aabmass
Copy link
Member

aabmass commented Jul 19, 2024

Also, even if they notice this, changing the deployments to introduce this temporary environment variable seems to be work that people would simply not do.

I think most users would just do the major version bump and in the best case they read about the change in advance and would change their configuration together with the update. In many cases they won't do this in the same step, because they did not read the changelog, then run into problems, look at the changelog and then change their configuration.

Yes I agree. But in the case of a minor version release, I think the user will rightfully come here complaining that we broke things. For a major version, it should be more obvious and it also signals which package might have created the breaking change.

Question for maintainers: is there any downside to a major bump beyond that we need to potentially backport security fixes for some finite window?

@aabmass
Copy link
Member

aabmass commented Jul 25, 2024

Follow up to the SIG discussion:

@xrmx's concern is fixing this in a timely manner.

@open-telemetry/python-maintainers is there any objection or downside to doing a major version bump? The only thing I can think of is that it will take us out of locksetup.

@lzchen
Copy link
Contributor

lzchen commented Jul 25, 2024

is there any objection or downside to doing a major version bump? The only thing I can think of is that it will take us out of locksetup.

Yes the only downside is the maintenance overhead. I think this is a good time to come with an automated process of handling packages that are not released in lockstep (especially now that sdk-extension-aws is also being added to the release cycle: open-telemetry/opentelemetry-python-contrib#1899). Currently, there is not a clean way to do this and may require a bit of complicated version management in our scripts.

@xrmx
Copy link
Contributor

xrmx commented Jul 29, 2024

This may be the source of the confusion where the timeout are expressed in seconds:
https://opentelemetry.io/docs/specs/otel/protocol/exporter/

@emdneto
Copy link
Member

emdneto commented Dec 12, 2024

We discussed this at SIG 12/12, and we'll investigate two possibilities:

  1. Get some precedence on what was done in the past for bug fixes in stable packages and not categorize it as a breaking change
  2. Major bump version -- needs some development of workflows

@lzchen
Copy link
Contributor

lzchen commented Dec 12, 2024

Here is an example of a behavior breaking change that we decided we would do. We viewed as being "not spec-compliant" and viewed it as a bug fix, A
Here is another example of this.

Here is an example of a non-spec compliant bug that is of similar nature to this one.

Since there is precedence already, perhaps we can treat this as a bugfix with a warning. The longer we hold off, the more potential users this will affect.

@danielgblanco
Copy link

I agree with the actions from the last SIG meeting, and I'll help to try to find precedence and opinions from other SIGs.

IMO we should update the compliance matrix first to indicate that Python is currently not compliant. This can help communicate the current state. It will probably take a while (in fact it already has) to become fully compliant, so that change can at least clear any doubts during the process. This will make all of OTEL_EXPORTER_OTLP_* seem not compliant, but worth stating it.

Another thing I think we should do as soon as possible would be to propose a change to the spec here to have the default in the unit supported by those env vars mentioned below (either that or mentioning in the env vars in the next line). This would avoid future confusion, not only for contributors but also for end-users.

Related to this change. Regardless of what approach is taken, it will require as much communication as technically possible. I do like the approach mentioned above regarding warning messages on startup (although perhaps leaving messages there for longer before and after the change). However, it doesn't solve the problem that, at some point, it will be a behaviour change and someone will be caught by it off guard. On the other hand, the longer we leave this not compliant, the larger the problem becomes as more people are using this env var. Of course, a major version is clearly a way to achieve wide communication of the change, but considering this is something not compliant, I'd personally prefer to avoid it, if at all possible.

While we gather more input, there's another aspect that I don't know if it was discussed, and may be worth considering. In an ideal world, we'd mark something as "deprecated" or "marked for removal" while providing a new, better way of doing the same thing that solves the problem. Could this be a possibility for the work to adopt declarative config (i.e. file config) to help? My thinking (and I'm not proposing this just thinking out loud) is the following:

  1. Mark Python as non-compliant in the matrix.
  2. Work on supporting file config, make it compliant, and wait for stability on that area. Yes, env var substitution would be a bit weird because the same concept would be in two different units, but at least when users adopt file config they can fix it at that point as long as instruction are clear.
  3. Add warning messages mentioned above, also providing a way do to this in a stable way (i.e. when file config is stable) and set a future version for compliance.

This may create more confusion than necessary, but thought it is worth bringing out the topic of file config, as it will come up regardless (i.e. if file config is implemented the issue with different units will appear regardless). Also, file config will be the recommended way of doing config as of OTEP 225 so something that may see adoption in any case.

@tedsuo
Copy link

tedsuo commented Dec 19, 2024

Yes, please don't break users on this one. Please mark the current env var as deprecated, move forwards with implementing config files, and add an additional OTEL_EXPORTER_OTLP_TIMEOUT_MILLI or something similar if having a millisecond-based env var is important. Staying backwards compatible is more important than staying compatible with the spec.

@danielgblanco
Copy link

danielgblanco commented Dec 20, 2024

Adding OTEL_EXPORTER_OTLP_TIMEOUT_MILLIS may not be that useful by itself for now, as one can currently do the same in seconds. However, when file config is supported it'll give an easier way to move back and forth between file or env for teams that are supporting packaged configs (or opinionated thin abstraction layers that deal with config setup).

Almost as if the recommended "template" to move from env to config file for Python could just use OTEL_EXPORTER_OTLP_TIMEOUT_MILLIS instead of OTEL_EXPORTER_OTLP_TIMEOUT in variable substitution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants