Propagate Step Function Trace Context through Managed Services #573

avedmala · 2025-03-11T16:58:24Z

What does this PR do?

Allows us to extract Step Function trace context in the following cases

SFN -> EventBridge -> Lambda
SFN -> EventBridge -> SQS -> Lambda
SFN -> SQS -> Lambda
SFN -> SNS -> Lambda
SFN -> SNS -> SQS -> Lambda

Motivation

Customer feature request for the following use case

Step Function has EventBridge PutEvents task that emits an event with a task token
- Enable “Wait for callback” on this
EventBridge matching rule will trigger a Lambda that will handle the event and resume the Step Function with the task token

It's not much more work to add the other cases so figured might as well

Testing Guidelines

SFN -> EventBridge -> Lambda trace
SFN -> EventBridge -> SQS -> Lambda trace

SFN -> SQS -> Lambda trace

SFN -> SNS -> Lambda trace

SFN -> SNS -> SQS -> Lambda trace

Additional Notes

The instrumentation involves injecting the _datadog: {...} trace context into these managed services from the Step Function definition, will add instructions in our public docs

Types of Changes

Bug fix
New feature
Breaking change
Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

This PR's description is comprehensive
This PR contains breaking changes that are documented in the description
This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
This PR impacts documentation, and it has been updated (or a ticket has been logged)
This PR's changes are covered by the automated tests
This PR collects user input/sensitive content into Datadog
This PR passes the integration tests (ask a Datadog member to run the tests)

datadog_lambda/trigger.py

avedmala · 2025-03-11T18:02:56Z

datadog_lambda/wrapper.py

@@ -279,8 +278,6 @@ def _before(self, event, context):
            self.response = None
            set_cold_start(init_timestamp_ns)
            submit_invocations_metric(context)
-            if is_legacy_lambda_step_function(event):
-                event = event["Payload"]


Moved this unwrapping to happen inside of tracing.extract_context_from_step_functions()

avedmala · 2025-03-11T18:23:04Z

datadog_lambda/tracing.py

@@ -1320,6 +1314,10 @@ def create_inferred_span_from_eventbridge_event(event, context):
    if span:
        span.set_tags(tags)
    span.start = dt.replace(tzinfo=timezone.utc).timestamp()
+
+    # Since inferred span will later parent Lambda, preserve Lambda's current parent
+    span.parent_id = dd_trace_context.span_id


This is important because we have the following code in tracing.create_function_execution_span()

if parent_span: span.parent_id = parent_span.span_id

where parent_span is the generated inferred span so the Lambda's root span's parent_id will be set to the inferred span's span_id

If there is an upstream Step Function and we saved its trace context in dd_trace_context, we want to preserve the parenting relationship and not let the inferred span completely erase it

This line solves the issue by making the inferred span be a child of the upstream service

avedmala · 2025-03-11T21:13:11Z

datadog_lambda/tracing.py

+    # Use more granular timestamp from upstream Step Function if possible
+    if is_step_function_event(event.get("detail")):
+        timestamp = event.get("detail").get("_datadog").get("State").get("EnteredTime")
+        dt_format = "%Y-%m-%dT%H:%M:%S.%fZ"


Without (1) and with (2) this logic comparison

By default, the timestamp provided by eventbridge is only down to the second and this can cause a case where the span starts before the step function (1). If we use the state entered time, this can look less confusing to customers (2).

Great job!

ps. I was confused about what (1) and (2) mean when I first glanced the PR. Originally, I thought that's some logic before this section of code that got hidden on the github.

Added numbers next to the pictures to make it clearer for other reviewers

kimi-p

Only minor nix. Really appreciate these great in-depth comments! Thank you!

datadog_lambda/trigger.py

kimi-p · 2025-03-13T00:51:36Z

tests/test_trigger.py

@@ -543,3 +544,68 @@ def test_extract_http_status_code_tag_from_response_object(self):
        response.status_code = 403
        status_code = extract_http_status_code_tag(trigger_tags, response)
        self.assertEqual(status_code, "403")
+
+
+class IsStepFunctionEvent(unittest.TestCase):


Nice! Thanks for putting the tests here which also make the code easier to understand for the future.

kimi-p · 2025-03-13T00:58:43Z

datadog_lambda/trigger.py

+    The whole event can be wrapped in "Payload" in Legacy Lambda cases. There may also be a
+    "_datadog" for JSONata style context propagation.
+
+    The actual event must contain "Execution", "StateMachine", and "State" fields.


Really like these comments. For someone who hasn't work on step functions for a while, these comments help me recollect these historical context. It'll help future maintenance of the code as well.

kimi-p · 2025-03-13T01:04:27Z

datadog_lambda/tracing.py

+    # Use more granular timestamp from upstream Step Function if possible
+    if is_step_function_event(event.get("detail")):
+        timestamp = event.get("detail").get("_datadog").get("State").get("EnteredTime")
+        dt_format = "%Y-%m-%dT%H:%M:%S.%fZ"


Great job!

ps. I was confused about what (1) and (2) mean when I first glanced the PR. Originally, I thought that's some logic before this section of code that got hidden on the github.

datadog_lambda/tracing.py

Co-authored-by: kimi <[email protected]>

tests/test_tracing.py

purple4reina · 2025-03-13T17:05:32Z

datadog_lambda/trigger.py

@@ -369,3 +367,28 @@ def extract_http_status_code_tag(trigger_tags, response):
        status_code = response.status_code

    return str(status_code)
+
+
+def is_step_function_event(event):


Is there a way we can memoize this function? It looks like it can potentially be called several times in the course of a single invocation.

Hmm, or it looks like the function can be called multiple times per invocation, but with different "events" each time? If that's true, then we can probably leave it.

That's a great idea!

Correct me if I'm wrong but does the layer only handle one event per invocation? Or if it's a busy Lambda does it stay alive and potentially handle hundreds of events?

Just wondering to get an idea of how large to make the cache. I guess it can be pretty small anyway since each event is new and we don't repeat

Each runtime instance will only ever handle one event at a time. It never handles two events concurrently.

Ah just realized we can't memoize it because event is a dict and mutable types are unhashable

We could serialize the dict and use that but I'm thinking that'd be much slower

datadog_lambda/trigger.py

purple4reina · 2025-03-13T17:16:25Z

datadog_lambda/tracing.py

    """
    try:
        detail = event.get("detail")
        dd_context = detail.get("_datadog")
        if not dd_context:
            return extract_context_from_lambda_context(lambda_context)
+
+        if is_step_function_event(dd_context):
+            return extract_context_from_step_functions(dd_context, lambda_context)


This one isn't wrapped in a try/except, but the two above are, why is that?

Ah I also meant to wrap this in try/except but lemme explain why I'm doing it

I wanted to pass in lambda_context=None so that if the extractor fails, it won't fallback on the lambda_context and it'll continue with the normal codepath and call

return propagator.extract(dd_context)

datadog_lambda/tracing.py

purple4reina

Left some questions about performance and exception handling.

nine5two7 · 2025-03-17T19:15:09Z

datadog_lambda/tracing.py

+            logger.debug(
+                "Failed to extract Step Functions context from EventBridge to SQS event."
+            )
+
    return propagator.extract(dd_context)


 def extract_context_from_eventbridge_event(event, lambda_context):


I am curious about how the concatenation of two queues (e.g., SFN → EventBridge → SQS → Lambda) is handled. Is it achieved by extracting two different contexts in the Python tracer? Does this mean that it also supports SFN → EventBridge → SQS → SNS → Lambda?

SFN → EventBridge → SQS → Lambda is handled the following way

Final event that enters the Lambda is an SQS event

Our context extractor for SQS events checks if there's an EventBridge event within and uses that if its valid

SFN → SNS → SQS → Lambda is handled very similarly with another explicitly check in the SQS extractor looking for SNS events nested

We don't handle SFN → SQS → SNS → Lambda AFAIK but we wouldn't be able to handle SFN → EventBridge → SQS → SNS → Lambda out of the box either

But this is only because it's not explicitly handled. The current python layer implementation is messy because it relies on explicit handling. I think a perfect solution would be one where it's all handled recursively and customers can nest an arbitrary number of supported services without explicit handling

I think AWS team would like to do something like this in bottlecap

@avedmala Thanks for the explanation. Very informative. I am guessing that a recursive solution should not be that complicated? @purple4reina @joeyzhao2018

Regarding the "recursive solution", is it written down in any RFC? it sounds interesting and might be able to solve some other problems.

nhulston · 2025-03-19T12:32:38Z

tests/test_tracing.py

+    @with_trace_propagation_style("datadog")
+    def test_step_function_trace_data_sns(self):
+        """Test step function trace data extraction through SNS"""
+        sns_event = {
+            "Records": [
+                {
+                    "EventSource": "aws:sns",
+                    "EventVersion": "1.0",


Wow this file is getting really long! Not your job, but a future task for the serverless AWS team could be to pull our these events as JSON files and import the json files in the tests, instead of defining the objects directly here

nhulston

Looks good to me! Super thorough, and nice job on manually testing + sharing screenshots for trace propagation case

avedmala added 2 commits March 11, 2025 12:57

support step function context in eventbridge extraction

a38c77b

fix circular import

0a51475

datadog-datadog-prod-us1 bot reviewed Mar 11, 2025

View reviewed changes

datadog_lambda/trigger.py Show resolved Hide resolved

DataDog deleted a comment from datadog-datadog-prod-us1 bot Mar 11, 2025

avedmala added 2 commits March 11, 2025 13:46

fix lint

5700d22

fix parent_id value in unit test

f524509

avedmala commented Mar 11, 2025

View reviewed changes

use provided timestamp from step function

55ff075

avedmala commented Mar 11, 2025

View reviewed changes

avedmala marked this pull request as ready for review March 12, 2025 13:54

avedmala requested review from a team as code owners March 12, 2025 13:54

extract context in eventbridge sqs case

d7daf0d

kimi-p approved these changes Mar 13, 2025

View reviewed changes

Update datadog_lambda/trigger.py

e490610

Co-authored-by: kimi <[email protected]>

avedmala changed the title ~~Trace Context Propagation for Step Function EventBridge Callback~~ Propagate Step Function Trace Context through Managed Services Mar 13, 2025

avedmala added 2 commits March 13, 2025 11:53

add sqs/sns support

e31c8f8

add more test cases

dc4a283

datadog-datadog-prod-us1 bot reviewed Mar 13, 2025

View reviewed changes

tests/test_tracing.py Show resolved Hide resolved

fix expected _dd.p.tid vals

f59de19

purple4reina reviewed Mar 13, 2025

View reviewed changes

datadog_lambda/trigger.py Outdated Show resolved Hide resolved

purple4reina reviewed Mar 13, 2025

View reviewed changes

datadog_lambda/trigger.py Outdated Show resolved Hide resolved

purple4reina reviewed Mar 13, 2025

View reviewed changes

datadog_lambda/tracing.py Outdated Show resolved Hide resolved

purple4reina reviewed Mar 13, 2025

View reviewed changes

avedmala added 3 commits March 13, 2025 15:50

add lru cache and remove usage of all()

6222016

exception handling for timestamp parsing

45ba945

remove memo

dc43c9e

avedmala added 2 commits March 17, 2025 09:36

refactor tests to use common helper

eec4743

removed event_type assertion

1b350b9

avedmala requested a review from purple4reina March 17, 2025 17:33

nine5two7 reviewed Mar 17, 2025

View reviewed changes

purple4reina approved these changes Mar 18, 2025

View reviewed changes

nhulston reviewed Mar 19, 2025

View reviewed changes

nhulston approved these changes Mar 19, 2025

View reviewed changes

avedmala merged commit 96a6abd into main Mar 19, 2025
60 checks passed

avedmala deleted the avedmala/sfn-eventbridge-tc branch March 19, 2025 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate Step Function Trace Context through Managed Services #573

Propagate Step Function Trace Context through Managed Services #573

avedmala commented Mar 11, 2025 •

edited

Loading

avedmala Mar 11, 2025

avedmala Mar 11, 2025

avedmala Mar 11, 2025 •

edited

Loading

kimi-p Mar 13, 2025

avedmala Mar 13, 2025

kimi-p left a comment

kimi-p Mar 13, 2025

kimi-p Mar 13, 2025

kimi-p Mar 13, 2025

purple4reina Mar 13, 2025

purple4reina Mar 13, 2025

avedmala Mar 13, 2025

purple4reina Mar 13, 2025

avedmala Mar 13, 2025

purple4reina Mar 13, 2025

avedmala Mar 13, 2025

purple4reina left a comment

nine5two7 Mar 17, 2025

avedmala Mar 17, 2025

nine5two7 Mar 17, 2025 •

edited

Loading

joeyzhao2018 Apr 3, 2025

nhulston Mar 19, 2025

nhulston left a comment

Propagate Step Function Trace Context through Managed Services #573

Propagate Step Function Trace Context through Managed Services #573

Conversation

avedmala commented Mar 11, 2025 • edited Loading

What does this PR do?

Motivation

Testing Guidelines

Additional Notes

Types of Changes

Check all that apply

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avedmala Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimi-p left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purple4reina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nine5two7 Mar 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nhulston left a comment

Choose a reason for hiding this comment

avedmala commented Mar 11, 2025 •

edited

Loading

avedmala Mar 11, 2025 •

edited

Loading

nine5two7 Mar 17, 2025 •

edited

Loading