-
Notifications
You must be signed in to change notification settings - Fork 46
Propagate Step Function Trace Context through Managed Services #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -279,8 +278,6 @@ def _before(self, event, context): | |||
self.response = None | |||
set_cold_start(init_timestamp_ns) | |||
submit_invocations_metric(context) | |||
if is_legacy_lambda_step_function(event): | |||
event = event["Payload"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this unwrapping to happen inside of tracing.extract_context_from_step_functions()
datadog_lambda/tracing.py
Outdated
@@ -1320,6 +1314,10 @@ def create_inferred_span_from_eventbridge_event(event, context): | |||
if span: | |||
span.set_tags(tags) | |||
span.start = dt.replace(tzinfo=timezone.utc).timestamp() | |||
|
|||
# Since inferred span will later parent Lambda, preserve Lambda's current parent | |||
span.parent_id = dd_trace_context.span_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important because we have the following code in tracing.create_function_execution_span()
if parent_span:
span.parent_id = parent_span.span_id
where parent_span
is the generated inferred span so the Lambda's root span's parent_id
will be set to the inferred span's span_id
If there is an upstream Step Function and we saved its trace context in dd_trace_context
, we want to preserve the parenting relationship and not let the inferred span completely erase it
This line solves the issue by making the inferred span be a child of the upstream service
datadog_lambda/tracing.py
Outdated
# Use more granular timestamp from upstream Step Function if possible | ||
if is_step_function_event(event.get("detail")): | ||
timestamp = event.get("detail").get("_datadog").get("State").get("EnteredTime") | ||
dt_format = "%Y-%m-%dT%H:%M:%S.%fZ" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
ps. I was confused about what (1)
and (2)
mean when I first glanced the PR. Originally, I thought that's some logic before this section of code that got hidden on the github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added numbers next to the pictures to make it clearer for other reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor nix. Really appreciate these great in-depth comments! Thank you!
@@ -543,3 +544,68 @@ def test_extract_http_status_code_tag_from_response_object(self): | |||
response.status_code = 403 | |||
status_code = extract_http_status_code_tag(trigger_tags, response) | |||
self.assertEqual(status_code, "403") | |||
|
|||
|
|||
class IsStepFunctionEvent(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for putting the tests here which also make the code easier to understand for the future.
The whole event can be wrapped in "Payload" in Legacy Lambda cases. There may also be a | ||
"_datadog" for JSONata style context propagation. | ||
|
||
The actual event must contain "Execution", "StateMachine", and "State" fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like these comments. For someone who hasn't work on step functions for a while, these comments help me recollect these historical context. It'll help future maintenance of the code as well.
datadog_lambda/tracing.py
Outdated
# Use more granular timestamp from upstream Step Function if possible | ||
if is_step_function_event(event.get("detail")): | ||
timestamp = event.get("detail").get("_datadog").get("State").get("EnteredTime") | ||
dt_format = "%Y-%m-%dT%H:%M:%S.%fZ" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
ps. I was confused about what (1)
and (2)
mean when I first glanced the PR. Originally, I thought that's some logic before this section of code that got hidden on the github.
Co-authored-by: kimi <[email protected]>
@@ -369,3 +367,28 @@ def extract_http_status_code_tag(trigger_tags, response): | |||
status_code = response.status_code | |||
|
|||
return str(status_code) | |||
|
|||
|
|||
def is_step_function_event(event): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can memoize this function? It looks like it can potentially be called several times in the course of a single invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, or it looks like the function can be called multiple times per invocation, but with different "events" each time? If that's true, then we can probably leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea!
Correct me if I'm wrong but does the layer only handle one event
per invocation? Or if it's a busy Lambda does it stay alive and potentially handle hundreds of events?
Just wondering to get an idea of how large to make the cache. I guess it can be pretty small anyway since each event is new and we don't repeat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each runtime instance will only ever handle one event at a time. It never handles two events concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah just realized we can't memoize it because event
is a dict
and mutable types are unhashable
We could serialize the dict and use that but I'm thinking that'd be much slower
datadog_lambda/tracing.py
Outdated
""" | ||
try: | ||
detail = event.get("detail") | ||
dd_context = detail.get("_datadog") | ||
if not dd_context: | ||
return extract_context_from_lambda_context(lambda_context) | ||
|
||
if is_step_function_event(dd_context): | ||
return extract_context_from_step_functions(dd_context, lambda_context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one isn't wrapped in a try/except, but the two above are, why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I also meant to wrap this in try/except but lemme explain why I'm doing it
I wanted to pass in lambda_context=None
so that if the extractor fails, it won't fallback on the lambda_context
and it'll continue with the normal codepath and call
return propagator.extract(dd_context)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions about performance and exception handling.
logger.debug( | ||
"Failed to extract Step Functions context from EventBridge to SQS event." | ||
) | ||
|
||
return propagator.extract(dd_context) | ||
|
||
|
||
def extract_context_from_eventbridge_event(event, lambda_context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about how the concatenation of two queues (e.g., SFN → EventBridge → SQS → Lambda) is handled. Is it achieved by extracting two different contexts in the Python tracer? Does this mean that it also supports SFN → EventBridge → SQS → SNS → Lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SFN → EventBridge → SQS → Lambda is handled the following way
- Final event that enters the Lambda is an SQS event
- Our context extractor for SQS events checks if there's an EventBridge event within and uses that if its valid
SFN → SNS → SQS → Lambda is handled very similarly with another explicitly check in the SQS extractor looking for SNS events nested
We don't handle SFN → SQS → SNS → Lambda AFAIK but we wouldn't be able to handle SFN → EventBridge → SQS → SNS → Lambda out of the box either
But this is only because it's not explicitly handled. The current python layer implementation is messy because it relies on explicit handling. I think a perfect solution would be one where it's all handled recursively and customers can nest an arbitrary number of supported services without explicit handling
I think AWS team would like to do something like this in bottlecap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avedmala Thanks for the explanation. Very informative. I am guessing that a recursive solution should not be that complicated? @purple4reina @joeyzhao2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the "recursive solution", is it written down in any RFC? it sounds interesting and might be able to solve some other problems.
@with_trace_propagation_style("datadog") | ||
def test_step_function_trace_data_sns(self): | ||
"""Test step function trace data extraction through SNS""" | ||
sns_event = { | ||
"Records": [ | ||
{ | ||
"EventSource": "aws:sns", | ||
"EventVersion": "1.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this file is getting really long! Not your job, but a future task for the serverless AWS team could be to pull our these events as JSON files and import the json files in the tests, instead of defining the objects directly here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Super thorough, and nice job on manually testing + sharing screenshots for trace propagation case
What does this PR do?
Allows us to extract Step Function trace context in the following cases
Motivation
Customer feature request for the following use case
It's not much more work to add the other cases so figured might as well
Testing Guidelines
SFN -> EventBridge -> Lambda trace

SFN -> EventBridge -> SQS -> Lambda trace
Additional Notes
The instrumentation involves injecting the
_datadog: {...}
trace context into these managed services from the Step Function definition, will add instructions in our public docsTypes of Changes
Check all that apply