-
Notifications
You must be signed in to change notification settings - Fork 703
Switch gRPC exporters to use official gRPC retry config. Make timeout
encompass retries/backoffs
#4564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
...orter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/metric_exporter/__init__.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running a basic example of sending span/metric to a non existent collector through grpc:
Before (timeout not respected)
$ OTEL_EXPORTER_OTLP_TIMEOUT=5 uv run repro.py
2025-05-09 01:19:57 INFO [test] Hello world
2025-05-09 01:19:57 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to localhost:4317, retrying in 1s.
2025-05-09 01:19:58 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to localhost:4317, retrying in 2s.
2025-05-09 01:20:00 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to localhost:4317, retrying in 4s.
^[[B2025-05-09 01:20:02 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317, retrying in 1s.
2025-05-09 01:20:03 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317, retrying in 2s.
2025-05-09 01:20:04 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to localhost:4317, retrying in 8s.
2025-05-09 01:20:05 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317, retrying in 4s.
2025-05-09 01:20:09 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317, retrying in 8s.
2025-05-09 01:20:12 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] Transient error StatusCode.UNAVAILABLE encountered while exporting metrics to localhost:4317, retrying in 16s.
Now (timeout is respected)
$ OTEL_EXPORTER_OTLP_TIMEOUT=5 uv run repro.py
2025-05-09 01:22:43 INFO [test] Hello world
2025-05-09 01:22:48 ERROR [opentelemetry.exporter.otlp.proto.grpc.exporter] Failed to export metrics to localhost:4317, error code: StatusCode.DEADLINE_EXCEEDED
2025-05-09 01:22:53 ERROR [opentelemetry.exporter.otlp.proto.grpc.exporter] Failed to export traces to localhost:4317, error code: StatusCode.DEADLINE_EXCEEDED
When exporting metrics/traces in the same program I noticed that:
Is this expected?
$ OTEL_EXPORTER_OTLP_TRACES_TIMEOUT=5 uv run repro.py
2025-05-09 15:37:54 INFO [test] Hello world
2025-05-09 15:38:04 ERROR [opentelemetry.exporter.otlp.proto.grpc.exporter] Failed to export traces to localhost:4317, error code: StatusCode.DEADLINE_EXCEEDED
2025-05-09 15:38:04 ERROR [opentelemetry.exporter.otlp.proto.grpc.exporter] Failed to export metrics to localhost:4317, error code: StatusCode.DEADLINE_EXCEEDED
...lemetry-exporter-opencensus/src/opentelemetry/exporter/opencensus/trace_exporter/__init__.py
Outdated
Show resolved
Hide resolved
...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py
Show resolved
Hide resolved
...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py
Outdated
Show resolved
Hide resolved
...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can add the description which issues that are fixed by this PR? https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue
"methodConfig": [ | ||
{ | ||
"name": [dict()], | ||
"retryPolicy": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this correctly implement the OTLP spec? https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#failures
I'm wondering about RetryInfo
and partial errors in particular
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't realize RetryInfo was part of OTLP spec. gRPC retry policy does not make use of RetryInfo, and it isn't mentioned in https://github.com/grpc/proposal/blob/master/A6-client-retries.md or https://grpc.io/docs/guides/retry/#retry-configuration. My reading of the partial success part is that the status code is still OK in that case, which this does handle correctly (by not retrying).
So you think we should remove this then ? I feel it's nice to let the gRPC library handle this stuff instead of hand rolling our own code.. If gRPC changes something around retries I'm assuming they'd update their own retry handling code and we wouldn't have to worry about it.
The gFC (https://github.com/grpc/proposal/blob/master/A6-client-retries.md) also mentions how the retryPolicy can be used for hedging and respect server pushback, but neither of those appear to be implemented yet
exporter/opentelemetry-exporter-otlp-proto-grpc/tests/test_otlp_exporter_mixin.py
Outdated
Show resolved
Hide resolved
I didn't see this in the code, if you updated can you just update the description as well? |
Updated the description |
timeout
encompass retries/backoffs
Description
Make
timeout
encompass retries and backoffs, rather than being applied per HTTP request or gRPC RPC.Update the grpc exporter to use the official gRPC
retry
config, which lets the officialgrpc
library handle retries / timeouts automatically in line with their official spec: https://github.com/grpc/proposal/blob/master/A6-client-retries.md instead of us having to maintain custom code.This means
RetryInfo
in gRPC error response's is now ignored. I discussed this with the spec committee and opened a bug (open-telemetry/opentelemetry-specification#4513) aboutRetryInfo
being in the official spec, instead of the gRPC supported header. They were generally supportive. I will try to update the official spec but am reaching out to the author of the spec first to get their input. gRPC retry policy didn't exist when the spec was written, so it's likely that it wasn't recommended.There is no equivalent for HTTP (there is a
urlilib3.RetryConfig
, but it's retries are not inclusive oftimeout
- instead each retry gets passed the same timeout), so I manually updated the HTTP exporters.Added a +/- 20% jitter to each backoff (both gRPC/HTTP).
fixes: #3309, #4043, #2663
I also cleaned up the HTTP exporter code a bit.
#4183 -- similar to this PR and what's discussed in #4043, but I implemented it in as minimal a way as I could..
Fixes # (issue)
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Lots of unit tests.
Does This PR Require a Contrib Repo Change?
Checklist: