From a6d6fadfaf3e817ccb2721524d8f0166ad44939d Mon Sep 17 00:00:00 2001 From: Scott Gerring Date: Fri, 14 Feb 2025 10:57:36 +0100 Subject: [PATCH 1/8] docs: Add ADR dir and error handling ADR --- docs/adr/001_error_handling.md | 97 ++++++++++++++++++++++++++++++++++ docs/adr/README.md | 5 ++ 2 files changed, 102 insertions(+) create mode 100644 docs/adr/001_error_handling.md create mode 100644 docs/adr/README.md diff --git a/docs/adr/001_error_handling.md b/docs/adr/001_error_handling.md new file mode 100644 index 0000000000..a692450a77 --- /dev/null +++ b/docs/adr/001_error_handling.md @@ -0,0 +1,97 @@ +# Error handling patterns in public API interfaces + + +## Context and Problem Statement +There is uncertainty around how to model errors in in the `opentelemetry-rust` public API interfaces - that is, APIs facing the consumers. At the time of writing this is an important issue to resolve as moving beginning to move the signals towards RC and eventually a stable release is an urgent priority. + +The situation is as follows; a concrete example is given, but the issue holds across various public traits, in particular the exporters: + +* A given public interface in `opentelemetry-sdk`,such as [trait LogExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/3ec4c186ad22944b208ae7c3d38435e735a85c8e/opentelemetry-sdk/src/logs/export.rs#L115) +* ... exposes multiple discrete actions with logically disjoint error types (e.g. [export](https://github.com/open-telemetry/opentelemetry-rust/blob/3ec4c186ad22944b208ae7c3d38435e735a85c8e/opentelemetry-sdk/src/logs/export.rs#L133-L136) and [shutdown](https://github.com/open-telemetry/opentelemetry-rust/blob/3ec4c186ad22944b208ae7c3d38435e735a85c8e/opentelemetry-sdk/src/logs/export.rs#L139) - that is, the class of errors returned for each of these actions are foreseeably very different, as is the callers reaction to them +* ... is implemented by multiple concrete types such as `InMemoryLogExporter`, `OtlpLogExporter`, `StdOutLogExporter` that have different error requirements - for instance, an `OtlpLogExporter` will experience network failures, an `InMemoryLogExporter` will not +* Potentially has operations on the API that, either in the direct implementation, or in a derived utility that utilises the direct implementation, call _multiple_ API actions and therefore need to return an aggregated log type + +Today, we have a situation where a single error type is used per API-trait, and some methods simply swallow their errors. In the example above of `LogExporter`, `shutdown` swallows errors, and `export` returns the `LogError` type, a type that could conceptually be thought of as belonging to the entire trait, not a particular method. For the exporters, the [opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) tells us that they need to indicate success or failure, with a distinction made between 'failed' and 'timed out'. + +There are also similar examples in the builders for providers and exports. + +## Related Work + +* #2564 +* #2561 +* #2381 + +## Considered Options + +**Option 1: Continue as is** +Continue the status quo, returning a mix of either nothing or the trait-wide error type. This is inconsistent and limits the caller's ability to handle errors. + +**Option 2: Extend trait-wide error type to all methods on trait** +In this option we keep the existing error type, add it to the remaining methods on the trait, and extend the error type to include errors covering the new error conditions. This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make. + +Conversely, it will reduce the number of error types in the code base. + +**Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type** + +For example, in the above we'd have something like: +```rust +pub trait LogExporter { + + fn export(...) -> Result<..., ExportError>; + fn shutdown(...) -> Result<..., ShutdownError> +} + +// Concrete errors for an export operation +pub enum ExportError { + // The distinction between failure and timed out is part of the OTEL spec + // we need to meet. + + ExportFailed, + + ExportTimedOut(Duration), + + // Allow impls to box up errors that can't be logically mapped + // back to one of the APIs errors + #[error("Unknown error (should not occur): {source:?}")] + Unknown { + source: Box, + }, +} + +// Aggregate error type for convenience +// **Note**: This will be added in response to need, not pre-emptively +#[derive(Debug, thiserror::Error)] +pub enum LogError { + #[error("Export error: {0}")] + InitError(#[from] ExportError), + + #[error("Shutdown error: {0}")] + ShutdownError(#[from] ShutdownError), +} + +// A downcast helper for callers that need to work with impl-specific +// unknown errors concretely +impl ExportError { + /// Attempt to downcast the inner `source` error to a specific type `T` + pub fn downcast_ref(&self) -> Option<&T> { + if let ExportError::Unknown { source } = self { + source.downcast_ref::() + } else { + None + } + } +} +``` + +## Decision Outcome + +Chosen option: **"Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type"** + +### Consequences + +* Good, because callers can handle focussed errors with focussed remediation +* Good, because implementors of the `pub trait`s can box up custom errors in a fashion that follow's [canonical's error and panic discipline](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) guide, by avoiding type erasure of impl-specific errors +* Good, because the per-trait error type (`LogError` for `LogExporter` above) provides consumers of the trait that hit multiple methods in a single method an error type they can use +* Bad, because there's more code than a single error type +* Bad, because a caller may need to use `downcast_ref` if they have a known trait impl and want to handle a `Unknown` error + diff --git a/docs/adr/README.md b/docs/adr/README.md new file mode 100644 index 0000000000..826cfae114 --- /dev/null +++ b/docs/adr/README.md @@ -0,0 +1,5 @@ +# Architectural Decision Records + +This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. + +* [001 - Error Handling](001_error_handling.md) \ No newline at end of file From 7faa9a62aa6ec1a73ac5f377b96c8acc59b1f960 Mon Sep 17 00:00:00 2001 From: Scott Gerring Date: Mon, 17 Feb 2025 13:35:23 +0100 Subject: [PATCH 2/8] chore: Add more info about ADRs --- docs/adr/001_error_handling.md | 127 +++++++++++++++++---------------- docs/adr/README.md | 4 +- 2 files changed, 67 insertions(+), 64 deletions(-) diff --git a/docs/adr/001_error_handling.md b/docs/adr/001_error_handling.md index a692450a77..34f9ae59c5 100644 --- a/docs/adr/001_error_handling.md +++ b/docs/adr/001_error_handling.md @@ -1,19 +1,16 @@ # Error handling patterns in public API interfaces +## Date +17 Feb 2025 ## Context and Problem Statement -There is uncertainty around how to model errors in in the `opentelemetry-rust` public API interfaces - that is, APIs facing the consumers. At the time of writing this is an important issue to resolve as moving beginning to move the signals towards RC and eventually a stable release is an urgent priority. +There is uncertainty around how to model errors in the `opentelemetry-rust` public API interfaces - that is, APIs that are exposed to users of the project's published crates. This is for example the case with the exporter traits - [SpanExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/trace/export.rs#L18), [LogExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/logs/export.rs#L115), and [PushMetricExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/metrics/exporter.rs#L11) which form part of the API surface of `opentelemetry-sdk`. -The situation is as follows; a concrete example is given, but the issue holds across various public traits, in particular the exporters: +We will focus on the exporter traits in this example, but the outcome should be applied to _all_ public traits and their fallible operations. -* A given public interface in `opentelemetry-sdk`,such as [trait LogExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/3ec4c186ad22944b208ae7c3d38435e735a85c8e/opentelemetry-sdk/src/logs/export.rs#L115) -* ... exposes multiple discrete actions with logically disjoint error types (e.g. [export](https://github.com/open-telemetry/opentelemetry-rust/blob/3ec4c186ad22944b208ae7c3d38435e735a85c8e/opentelemetry-sdk/src/logs/export.rs#L133-L136) and [shutdown](https://github.com/open-telemetry/opentelemetry-rust/blob/3ec4c186ad22944b208ae7c3d38435e735a85c8e/opentelemetry-sdk/src/logs/export.rs#L139) - that is, the class of errors returned for each of these actions are foreseeably very different, as is the callers reaction to them -* ... is implemented by multiple concrete types such as `InMemoryLogExporter`, `OtlpLogExporter`, `StdOutLogExporter` that have different error requirements - for instance, an `OtlpLogExporter` will experience network failures, an `InMemoryLogExporter` will not -* Potentially has operations on the API that, either in the direct implementation, or in a derived utility that utilises the direct implementation, call _multiple_ API actions and therefore need to return an aggregated log type +There are various ways to handle errors on trait methods, including swallowing them and logging, panicing, returning a shared global error, or returning a method-specific error. We strive for consistency, and we want to be sure that we've put enough thought into what this looks like that we don't have to make breaking interface changes unecessarily in the future. -Today, we have a situation where a single error type is used per API-trait, and some methods simply swallow their errors. In the example above of `LogExporter`, `shutdown` swallows errors, and `export` returns the `LogError` type, a type that could conceptually be thought of as belonging to the entire trait, not a particular method. For the exporters, the [opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) tells us that they need to indicate success or failure, with a distinction made between 'failed' and 'timed out'. - -There are also similar examples in the builders for providers and exports. +This was discussed extensively in #2571. ## Related Work @@ -27,71 +24,77 @@ There are also similar examples in the builders for providers and exports. Continue the status quo, returning a mix of either nothing or the trait-wide error type. This is inconsistent and limits the caller's ability to handle errors. **Option 2: Extend trait-wide error type to all methods on trait** -In this option we keep the existing error type, add it to the remaining methods on the trait, and extend the error type to include errors covering the new error conditions. This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make. +In this option we have an error type per trait regardless of the potential error paths for the individual methods. Concretely if `fn (a)` can return `Failure1` and `Failure2`, and `fn (b)` can return `FailureC`, we have a unified error type that contains `Failure`, `Failure2`, and `Failure3`. + + This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make. Conversely, it will reduce the number of error types in the code base. -**Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type** +**Option 3: Use shared errors where practical across signals, devolve into individual errors per operation if they need to diverge** + +Here we aim to consolidate error types where possible _without indicating a function may return more errors than it can actually return_. Conversely in **Option 2**, a caller of either of the example functions is forced to handle or discard all errors. In this case, we choose to sacrifice the single error and diverge into a separate error for each operation. + +Our preference for error types is thus: + +1. Consolidated error that covers all methods of a particular "trait type" (e.g., signal export) and method +1. Devolves into error type per method of a particular trait type (e.g., `SdkShutdownResult`, `SdkExportResult`) _if the error types need to diverge_ +1. May alternatively devolve into error type per signal (e.g., `SpanExporter`) if the _signals diverge_ + +This approach generalises across both **signals** and **trait methods**. For example, returning to our exporter traits, we have a trait that looks the same for each signal, with the same three methods. Upon closer inspection (#2600), the potential error set is the same both between the methods *and* between the signals; this means we can use a single shared error type across both axes: -For example, in the above we'd have something like: ```rust -pub trait LogExporter { - - fn export(...) -> Result<..., ExportError>; - fn shutdown(...) -> Result<..., ShutdownError> -} -// Concrete errors for an export operation -pub enum ExportError { - // The distinction between failure and timed out is part of the OTEL spec - // we need to meet. - - ExportFailed, - - ExportTimedOut(Duration), - - // Allow impls to box up errors that can't be logically mapped - // back to one of the APIs errors - #[error("Unknown error (should not occur): {source:?}")] - Unknown { - source: Box, - }, -} +#[derive(Error, Debug)] -// Aggregate error type for convenience -// **Note**: This will be added in response to need, not pre-emptively -#[derive(Debug, thiserror::Error)] -pub enum LogError { - #[error("Export error: {0}")] - InitError(#[from] ExportError), - - #[error("Shutdown error: {0}")] - ShutdownError(#[from] ShutdownError), -} +// Errors that can occur during SDK operations export(), force_flush() and shutdown(). +pub enum OTelSdkError { + + /// All error types in here may be returned by any of the operations + /// of the exporters, on any of their APIs. + /// If this were not the case, we would not be able to use a shared error. + #[error("Shutdown already invoked")] + AlreadyShutdown, + + // ... Other errors ... -// A downcast helper for callers that need to work with impl-specific -// unknown errors concretely -impl ExportError { - /// Attempt to downcast the inner `source` error to a specific type `T` - pub fn downcast_ref(&self) -> Option<&T> { - if let ExportError::Unknown { source } = self { - source.downcast_ref::() - } else { - None - } - } + /// The error message is intended for logging purposes only and should not + /// be used to make programmatic decisions. It is implementation-specific + /// and subject to change without notice. Consumers of this error should not + /// rely on its content beyond logging. + #[error("Operation failed: {0}")] + InternalFailure(String), } + +pub type OTelSdkResult = Result<(), OTelSdkError>; ``` -## Decision Outcome +... which the traits themselves use: + +```rust + +// +// The actionable errors returned by the exporter traits are effectively +// the same for all operations; we can use a single shared error. +// -Chosen option: **"Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type"** +use opentelemetry_sdk::error::OTelSdkResult; -### Consequences +pub trait LogExporter { + fn export(...) -> OtelSdkResult; + fn shutdown(...) -> OtelSdkResult; + fn force_flush(...) -> OTelSdkResult; +} + +// ... + +pub trait SpanExporter { + fn export(...) -> OtelSdkResult; + fn shutdown(...) -> OtelSdkResult; + fn force_flush(...) -> OTelSdkResult; +} + +``` -* Good, because callers can handle focussed errors with focussed remediation -* Good, because implementors of the `pub trait`s can box up custom errors in a fashion that follow's [canonical's error and panic discipline](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) guide, by avoiding type erasure of impl-specific errors -* Good, because the per-trait error type (`LogError` for `LogExporter` above) provides consumers of the trait that hit multiple methods in a single method an error type they can use -* Bad, because there's more code than a single error type -* Bad, because a caller may need to use `downcast_ref` if they have a known trait impl and want to handle a `Unknown` error +Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller. +If the caller may potentially recover from an error, we will follow [canonical's rust best practices](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error. diff --git a/docs/adr/README.md b/docs/adr/README.md index 826cfae114..ae181011ec 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -1,5 +1,5 @@ # Architectural Decision Records -This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. +This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. You can read more about ADRs [here](https://adr.github.io/). -* [001 - Error Handling](001_error_handling.md) \ No newline at end of file +* [001 - Error Handling](001_error_handling.md) From 95090eaa1c35575c21f9be664a120c1446a2e066 Mon Sep 17 00:00:00 2001 From: Scott Gerring Date: Mon, 17 Feb 2025 13:47:29 +0100 Subject: [PATCH 3/8] Some more detail --- docs/adr/001_error_handling.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/adr/001_error_handling.md b/docs/adr/001_error_handling.md index 34f9ae59c5..4115f2fc16 100644 --- a/docs/adr/001_error_handling.md +++ b/docs/adr/001_error_handling.md @@ -3,6 +3,10 @@ ## Date 17 Feb 2025 +## Accepted Option + +**Option 3** + ## Context and Problem Statement There is uncertainty around how to model errors in the `opentelemetry-rust` public API interfaces - that is, APIs that are exposed to users of the project's published crates. This is for example the case with the exporter traits - [SpanExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/trace/export.rs#L18), [LogExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/logs/export.rs#L115), and [PushMetricExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/metrics/exporter.rs#L11) which form part of the API surface of `opentelemetry-sdk`. @@ -12,6 +16,7 @@ There are various ways to handle errors on trait methods, including swallowing t This was discussed extensively in #2571. + ## Related Work * #2564 @@ -95,6 +100,8 @@ pub trait SpanExporter { ``` +### When to box custom errors + Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller. If the caller may potentially recover from an error, we will follow [canonical's rust best practices](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error. From f2a9bf8cee3418207b11069c626371591e6607c8 Mon Sep 17 00:00:00 2001 From: Scott Gerring Date: Mon, 17 Feb 2025 13:53:49 +0100 Subject: [PATCH 4/8] fix links --- docs/adr/001_error_handling.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/adr/001_error_handling.md b/docs/adr/001_error_handling.md index 4115f2fc16..7ed36398c7 100644 --- a/docs/adr/001_error_handling.md +++ b/docs/adr/001_error_handling.md @@ -14,14 +14,14 @@ We will focus on the exporter traits in this example, but the outcome should be There are various ways to handle errors on trait methods, including swallowing them and logging, panicing, returning a shared global error, or returning a method-specific error. We strive for consistency, and we want to be sure that we've put enough thought into what this looks like that we don't have to make breaking interface changes unecessarily in the future. -This was discussed extensively in #2571. +This was discussed extensively in [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571). ## Related Work -* #2564 -* #2561 -* #2381 +* [#2564](https://github.com/open-telemetry/opentelemetry-rust/issues/2564) +* [#2561](https://github.com/open-telemetry/opentelemetry-rust/issues/2561) +* [#2381](https://github.com/open-telemetry/opentelemetry-rust/issues/2381) ## Considered Options @@ -45,7 +45,7 @@ Our preference for error types is thus: 1. Devolves into error type per method of a particular trait type (e.g., `SdkShutdownResult`, `SdkExportResult`) _if the error types need to diverge_ 1. May alternatively devolve into error type per signal (e.g., `SpanExporter`) if the _signals diverge_ -This approach generalises across both **signals** and **trait methods**. For example, returning to our exporter traits, we have a trait that looks the same for each signal, with the same three methods. Upon closer inspection (#2600), the potential error set is the same both between the methods *and* between the signals; this means we can use a single shared error type across both axes: +This approach generalises across both **signals** and **trait methods**. For example, returning to our exporter traits, we have a trait that looks the same for each signal, with the same three methods. Upon closer inspection ([#2600](https://github.com/open-telemetry/opentelemetry-rust/issues/2600)), the potential error set is the same both between the methods *and* between the signals; this means we can use a single shared error type across both axes: ```rust From f152a0a9ad8d0cc58893107655c7028989a18580 Mon Sep 17 00:00:00 2001 From: Scott Gerring Date: Thu, 27 Feb 2025 16:00:54 +0100 Subject: [PATCH 5/8] Update ADR format to be more prescriptive --- docs/adr/001_error_handling.md | 133 ++++++++++++++++----------------- docs/adr/README.md | 2 +- 2 files changed, 66 insertions(+), 69 deletions(-) diff --git a/docs/adr/001_error_handling.md b/docs/adr/001_error_handling.md index 7ed36398c7..37e2cc9743 100644 --- a/docs/adr/001_error_handling.md +++ b/docs/adr/001_error_handling.md @@ -1,107 +1,104 @@ # Error handling patterns in public API interfaces - ## Date -17 Feb 2025 - -## Accepted Option +27 Feb 2025 -**Option 3** +## Summary -## Context and Problem Statement -There is uncertainty around how to model errors in the `opentelemetry-rust` public API interfaces - that is, APIs that are exposed to users of the project's published crates. This is for example the case with the exporter traits - [SpanExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/trace/export.rs#L18), [LogExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/logs/export.rs#L115), and [PushMetricExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/metrics/exporter.rs#L11) which form part of the API surface of `opentelemetry-sdk`. +This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue. We will focus on the exporter traits in this example, but the outcome should be applied to _all_ public traits and their fallible operations. -There are various ways to handle errors on trait methods, including swallowing them and logging, panicing, returning a shared global error, or returning a method-specific error. We strive for consistency, and we want to be sure that we've put enough thought into what this looks like that we don't have to make breaking interface changes unecessarily in the future. - -This was discussed extensively in [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571). - - -## Related Work - -* [#2564](https://github.com/open-telemetry/opentelemetry-rust/issues/2564) -* [#2561](https://github.com/open-telemetry/opentelemetry-rust/issues/2561) -* [#2381](https://github.com/open-telemetry/opentelemetry-rust/issues/2381) +These include [SpanExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/trace/export.rs#L18), [LogExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/logs/export.rs#L115), and [PushMetricExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/metrics/exporter.rs#L11) which form part of the API surface of `opentelemetry-sdk`. -## Considered Options +There are various ways to handle errors on trait methods, including swallowing them and logging, panicing, returning a shared global error, or returning a method-specific error. We strive for consistency, and we want to be sure that we've put enough thought into what this looks like that we don't have to make breaking interface changes unecessarily in the future. -**Option 1: Continue as is** -Continue the status quo, returning a mix of either nothing or the trait-wide error type. This is inconsistent and limits the caller's ability to handle errors. +## Design Guidance -**Option 2: Extend trait-wide error type to all methods on trait** -In this option we have an error type per trait regardless of the potential error paths for the individual methods. Concretely if `fn (a)` can return `Failure1` and `Failure2`, and `fn (b)` can return `FailureC`, we have a unified error type that contains `Failure`, `Failure2`, and `Failure3`. +### 1. Panics are only appropriate at startup +Failures _after_ startup should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate. +Some of the opentelemetry SDK interfaces are dictated by the specification in way such that they may not return errors. - This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make. +### 2. Consolidate error types within a trait where we can, let them diverge when we can't** -Conversely, it will reduce the number of error types in the code base. +We aim to consolidate error types where possible _without indicating a function may return more errors than it can actually return_. +Here's an example: -**Option 3: Use shared errors where practical across signals, devolve into individual errors per operation if they need to diverge** +```rust +enum ErrorOne { + TooBig, + TooSmall, +} -Here we aim to consolidate error types where possible _without indicating a function may return more errors than it can actually return_. Conversely in **Option 2**, a caller of either of the example functions is forced to handle or discard all errors. In this case, we choose to sacrifice the single error and diverge into a separate error for each operation. +enum ErrorTwo { + TooLong, + TooShort +} -Our preference for error types is thus: +trait MyTrait { + fn action_one() -> Result<(), ErrorOne>; -1. Consolidated error that covers all methods of a particular "trait type" (e.g., signal export) and method -1. Devolves into error type per method of a particular trait type (e.g., `SdkShutdownResult`, `SdkExportResult`) _if the error types need to diverge_ -1. May alternatively devolve into error type per signal (e.g., `SpanExporter`) if the _signals diverge_ + // Action two and three share the same error type. + // We do not introduce a common error MyTraitError for all operations, as this would + // force all methods on the trait to indicate they return errors they do not return, + // complicating things for the caller. + fn action_two() -> Result<(), ErrorTwo>; + fn action_three() -> Result<(), ErrorTwo>; +} +``` +## 3. Consolidate error types between signals where we can, let them diverge where we can't -This approach generalises across both **signals** and **trait methods**. For example, returning to our exporter traits, we have a trait that looks the same for each signal, with the same three methods. Upon closer inspection ([#2600](https://github.com/open-telemetry/opentelemetry-rust/issues/2600)), the potential error set is the same both between the methods *and* between the signals; this means we can use a single shared error type across both axes: +Consider the `Exporter`s mentioned earlier. Each of them has the same failure indicators - as dicated by the OpenTelemetry spec - and we will +share the error types accordingly: ```rust -#[derive(Error, Debug)] +/// opentelemetry-sdk::error -// Errors that can occur during SDK operations export(), force_flush() and shutdown(). +#[derive(Error, Debug)] pub enum OTelSdkError { - - /// All error types in here may be returned by any of the operations - /// of the exporters, on any of their APIs. - /// If this were not the case, we would not be able to use a shared error. #[error("Shutdown already invoked")] AlreadyShutdown, - - // ... Other errors ... - - /// The error message is intended for logging purposes only and should not - /// be used to make programmatic decisions. It is implementation-specific - /// and subject to change without notice. Consumers of this error should not - /// rely on its content beyond logging. + #[error("Operation failed: {0}")] InternalFailure(String), + + /** ... additional errors ... **/ } pub type OTelSdkResult = Result<(), OTelSdkError>; -``` -... which the traits themselves use: +/// signal-specific exporter traits all share the same +/// result types for the exporter operations. -```rust +// pub trait LogExporter { +// pub trait SpanExporter { +pub trait PushMetricExporter { -// -// The actionable errors returned by the exporter traits are effectively -// the same for all operations; we can use a single shared error. -// + fn export(&self, /* ... */) -> OtelSdkResult; + fn force_flush(&self, /* ... */ ) -> OTelSdkResult; + fn shutdown(&self, /* ... */ ) -> OTelSdkResult; +``` -use opentelemetry_sdk::error::OTelSdkResult; +If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle - +we would let that error traits diverge at that point. -pub trait LogExporter { - fn export(...) -> OtelSdkResult; - fn shutdown(...) -> OtelSdkResult; - fn force_flush(...) -> OTelSdkResult; -} +### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not -// ... +Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller. -pub trait SpanExporter { - fn export(...) -> OtelSdkResult; - fn shutdown(...) -> OtelSdkResult; - fn force_flush(...) -> OTelSdkResult; -} +If the caller may potentially recover from an error, we will follow the generally-accepted best practice (e.g., see [canonical's guide](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error: +```rust +#[derive(Debug, Error)] +pub enum MyError { + #[error("Error one occurred")] + ErrorOne, + + #[error("Operation failed: {source}")] + OtherError { + #[from] + source: Box, + }, +} ``` -### When to box custom errors - -Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller. - -If the caller may potentially recover from an error, we will follow [canonical's rust best practices](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error. diff --git a/docs/adr/README.md b/docs/adr/README.md index ae181011ec..b2bd543db1 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -1,5 +1,5 @@ # Architectural Decision Records -This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. You can read more about ADRs [here](https://adr.github.io/). +This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. * [001 - Error Handling](001_error_handling.md) From 9c46625e97b1087040e02ba8c50f5629ead8c6e7 Mon Sep 17 00:00:00 2001 From: Scott Gerring Date: Fri, 28 Feb 2025 08:15:57 +0100 Subject: [PATCH 6/8] changed startup wording --- docs/adr/001_error_handling.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/adr/001_error_handling.md b/docs/adr/001_error_handling.md index 37e2cc9743..342c609eea 100644 --- a/docs/adr/001_error_handling.md +++ b/docs/adr/001_error_handling.md @@ -14,8 +14,8 @@ There are various ways to handle errors on trait methods, including swallowing t ## Design Guidance -### 1. Panics are only appropriate at startup -Failures _after_ startup should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate. +### 1. No panics from SDK APIs +Failures during regular operation should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate. Some of the opentelemetry SDK interfaces are dictated by the specification in way such that they may not return errors. ### 2. Consolidate error types within a trait where we can, let them diverge when we can't** @@ -73,7 +73,6 @@ pub type OTelSdkResult = Result<(), OTelSdkError>; // pub trait LogExporter { // pub trait SpanExporter { pub trait PushMetricExporter { - fn export(&self, /* ... */) -> OtelSdkResult; fn force_flush(&self, /* ... */ ) -> OTelSdkResult; fn shutdown(&self, /* ... */ ) -> OTelSdkResult; From 749454d657fc4fe325882e1c58ea40d259d02c8c Mon Sep 17 00:00:00 2001 From: Scott Gerring Date: Thu, 6 Mar 2025 13:32:58 +0100 Subject: [PATCH 7/8] Apply suggestions from code review Co-authored-by: Zhongyang Wu Co-authored-by: Cijo Thomas --- docs/adr/001_error_handling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/adr/001_error_handling.md b/docs/adr/001_error_handling.md index 342c609eea..b33b80e0fd 100644 --- a/docs/adr/001_error_handling.md +++ b/docs/adr/001_error_handling.md @@ -4,7 +4,7 @@ ## Summary -This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue. +This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue. We will focus on the exporter traits in this example, but the outcome should be applied to _all_ public traits and their fallible operations. @@ -83,7 +83,7 @@ we would let that error traits diverge at that point. ### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not -Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller. +Note above that we do not box any `Error` into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller. If the caller may potentially recover from an error, we will follow the generally-accepted best practice (e.g., see [canonical's guide](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error: From 73df360d7b65ab1e92dd2150fa36d5e560fc756e Mon Sep 17 00:00:00 2001 From: Scott Gerring Date: Thu, 6 Mar 2025 13:49:29 +0100 Subject: [PATCH 8/8] address review feedback --- docs/adr/001_error_handling.md | 72 +++++++++++++++++++++++++++++++++- 1 file changed, 71 insertions(+), 1 deletion(-) diff --git a/docs/adr/001_error_handling.md b/docs/adr/001_error_handling.md index b33b80e0fd..58aa1246ad 100644 --- a/docs/adr/001_error_handling.md +++ b/docs/adr/001_error_handling.md @@ -21,7 +21,28 @@ Some of the opentelemetry SDK interfaces are dictated by the specification in wa ### 2. Consolidate error types within a trait where we can, let them diverge when we can't** We aim to consolidate error types where possible _without indicating a function may return more errors than it can actually return_. -Here's an example: + +**Don't do this** - each function's signature indicates that it returns errors it will _never_ return, forcing the caller to write handlers for dead paths: +```rust +enum MegaError { + TooBig, + TooSmall, + TooLong, + TooShort +} + +trait MyTrait { + + // Will only ever return TooBig,TooSmall errors + fn action_one() -> Result<(), MegaError>; + + // These will only ever return TooLong,TooShort errors + fn action_two() -> Result<(), MegaError>; + fn action_three() -> Result<(), MegaError>; +} +``` + +**Instead, do this** - each function's signature indicates only the errors it can return, providing an accurate contract to the caller: ```rust enum ErrorOne { @@ -45,11 +66,40 @@ trait MyTrait { fn action_three() -> Result<(), ErrorTwo>; } ``` + ## 3. Consolidate error types between signals where we can, let them diverge where we can't Consider the `Exporter`s mentioned earlier. Each of them has the same failure indicators - as dicated by the OpenTelemetry spec - and we will share the error types accordingly: +**Don't do this** - each signal has its own error type, despite having exactly the same failure cases: + +```rust +#[derive(Error, Debug)] +pub enum OtelTraceError { + #[error("Shutdown already invoked")] + AlreadyShutdown, + + #[error("Operation failed: {0}")] + InternalFailure(String), + + /** ... additional errors ... **/ +} + +#[derive(Error, Debug)] +pub enum OtelLogError { + #[error("Shutdown already invoked")] + AlreadyShutdown, + + #[error("Operation failed: {0}")] + InternalFailure(String), + + /** ... additional errors ... **/ +} +``` + +**Instead, do this** - error types are consolidated between signals where this can be done appropriately: + ```rust /// opentelemetry-sdk::error @@ -87,6 +137,20 @@ Note above that we do not box any `Error` into `InternalFailure`. Our rule here If the caller may potentially recover from an error, we will follow the generally-accepted best practice (e.g., see [canonical's guide](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error: +**Don't do this if the OtherError is potentially recoverable by a savvy caller**: +```rust + +#[derive(Debug, Error)] +pub enum MyError { + #[error("Error one occurred")] + ErrorOne, + + #[error("Operation failed: {0}")] + OtherError(String), +``` + +**Instead, do this**, allowing the caller to match on the nested error: + ```rust #[derive(Debug, Error)] pub enum MyError { @@ -101,3 +165,9 @@ pub enum MyError { } ``` +Note that at the time of writing, there is no instance we have identified within the project that has required this. + +### 5. Use thiserror by default +We will use [thiserror](https://docs.rs/thiserror/latest/thiserror/) by default to implement Rust's [error trait](https://doc.rust-lang.org/core/error/trait.Error.html). +This keeps our code clean, and as it does not appear in our interface, we can choose to replace any particular usage with a hand-rolled implementation should we need to. +