Skip to content

Improve recorded detail of stacktraces #70

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 5, 2018

Conversation

larsbutler
Copy link
Contributor

@larsbutler larsbutler commented Jun 22, 2018

In many X-Ray SDK use cases, including the Django/Flask middleware and
aws_xray_sdk.core.xray_recorder.capture and .capture_async cases, there
is a broad try/except "catch-all" for exceptions which records useful
information in an X-Ray trace in the event of an unexpected exception.

However, the approach (prior to this patch) of using
traceback.extract_stack in the except block to extract stack trace
information isn't terribly useful when viewing X-Ray traces for
debugging purposes. traceback.extract_stack only shows the stack state
up to and including the traceback.extract_stack call in the except
block; it does not include the stack trace information from the caught
exception--which are arguably the most important details to include in
the X-Ray trace! Similarly, including only the exception's stack trace
information can omit important code path context if the
exeception-producing code is wrapped in a decorator. This is a very
subtle but important detail. There is a good discussion of exactly this
problem--specificaly in the context of decorators with a "catch-all"
try/except--on Stackoverflow:
https://stackoverflow.com/questions/14527819/traceback-shows-up-until-decorator.

Description of changes:

This patch introduces the stacktrace utility module and is an attempt
to include more relevant info in recorded X-Ray traces while respecting
the max_trace_back behavior expressed in many places to avoid bloated
trace messages.

To illustrate the difference this makes, consider a trivial Lambda function written in Python:

from aws_xray_sdk.core import xray_recorder 
                                            
                                            
@xray_recorder.capture('handler')           
def handler(event, context):                
    data = foo()                            
    return data                             
                                            
                                            
def foo():                                  
    return bar()                            
                                            
                                                                                   
def bar():                                  
    raise Exception('Something went wrong!')
    return {'data': [1,2,3]}

There are multiple function calls here in order to create an interesting call stack for tracing purposes. To test this function, I've simply built a function zip with the appropriate version of the X-Ray SDK and invoked the function with lambda:Invoke.

With the current implementation of aws_xray_sdk, here's the trace that gets recorded by X-Ray:

before-trace

Exception: Something went wrong!
at <module> (bootstrap.py:538)
at main (bootstrap.py:533)
at handle_event_request (bootstrap.py:250) 

With this patch applied, we get a lot more information about the call stack, as we would expect:

after-trace

Exception: Something went wrong!
at <module> (bootstrap.py:538)
at main (bootstrap.py:533)
at handle_event_request (bootstrap.py:250)
at handler (myfunction.py:6)
at foo (myfunction.py:11)
at bar (myfunction.py:15)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.


I'm not sure how best to test this (since stack information is very runtime-specific) so ideas are welcome. My patch also assumes that this fix applies the original intended behavior so it shouldn't be too controversial, but I'm naturally happy to discuss and attempt alternative solutions. Please let me know if the problem I've described above isn't clear.

@larsbutler
Copy link
Contributor Author

Hey @haotianw465, the CI build failures (related to aiohttp) don't appear to have anything to do with this patch. The issue seems to exist in the previous commit as well: https://travis-ci.org/aws/aws-xray-sdk-python/builds/389484429?utm_source=github_status&utm_medium=notification.

It seems the problem is related to aiohttp versions, but I haven't delved into it yet. Do you have any ideas?

@haotianw465
Copy link
Contributor

Thank you for submitting the PR. There is a known issue on the CI dependencies and I'm working on a fix. I will get back to the PR once the CI is back running. Thank you for your patience.

@larsbutler
Copy link
Contributor Author

@haotianw465 Okay, cool. I'll check back later.

@haotianw465
Copy link
Contributor

Sorry for the late follow up. Thank you for diving deep and contribution.

Do you mind putting two example subsegment outputs with/without the fix recording the same exception? This is both for clarification and future reference should anyone want to know more about this change.

@larsbutler
Copy link
Contributor Author

@haotianw465 Hey, thanks for fixing the build. I'll rebase on that.

Do you mind putting two example subsegment outputs with/without the fix recording the same exception? This is both for clarification and future reference should anyone want to know more about this change.

Sure, I'll come up with a small sample Lambda function and post the before/after traceback details.

Do you want me to record that information in a commit message as well (so that folks digging through the code can find it easier), or is it sufficient to just post it here?

In many X-Ray SDK use cases, including the Django/Flask middleware and
aws_xray_sdk.core.xray_recorder.capture and .capture_async cases, there
is a broad try/except "catch-all" for exceptions which records useful
information in an X-Ray trace in the event of an unexpected exception.

However, the approach (prior to this patch) of using
`traceback.extract_stack` in the `except` block to extract stack trace
information isn't terribly useful when viewing X-Ray traces for
debugging purposes. `traceback.extract_stack` only shows the stack state
up to and including the `traceback.extract_stack` call in the `except`
block; it does not include the stack trace information from the caught
exception--which are arguably the most important details to include in
the X-Ray trace! Similarly, including _only_ the exception's stack trace
information can omit important code path context if the
exeception-producing code is wrapped in a decorator. This is a very
subtle but important detail. There is a good discussion of exactly this
problem--specificaly in the context of decorators with a "catch-all"
try/except--on Stackoverflow:
https://stackoverflow.com/questions/14527819/traceback-shows-up-until-decorator.

This patch introduces the `stacktrace` utility module and is an attempt
to include more relevant info in recorded X-Ray traces while respecting
the `max_trace_back` behavior expressed in many places to avoid bloated
trace messages.

To illustrate the differencethis makes, consider a trivial Lambda
function written in Python with a single module called `myfunction.py`:

```
from aws_xray_sdk.core import xray_recorder

@xray_recorder.capture('handler')
def handler(event, context):
    data = foo()
    return data

def foo():
    return bar()

def bar():
    raise Exception('Something went wrong!')
    return {'data': [1,2,3]}
```

There are multiple function calls here in order to create an
interesting call stack for tracing purposes.

With the current implementation of aws_xray_sdk, here's the trace that
gets recorded by X-Ray:

    Exception: Something went wrong!
    at <module> (bootstrap.py:538)
    at main (bootstrap.py:533)
    at handle_event_request (bootstrap.py:250)

With this patch applied, we get a lot more information about the call
stack, as we would expect:

    Exception: Something went wrong!
    at <module> (bootstrap.py:538)
    at main (bootstrap.py:533)
    at handle_event_request (bootstrap.py:250)
    at handler (myfunction.py:6)
    at foo (myfunction.py:11)
    at bar (myfunction.py:15)

With this we can much more easily trace the origin of a runtime error.
@larsbutler
Copy link
Contributor Author

@haotianw465 I updated the PR text and added the examples as you've request. I also went ahead and decided it would be useful to include in the commit message a summary of the trace differences before and after this patch is applied.

Does that help?

@haotianw465
Copy link
Contributor

Yes that would be great! Thank you for the contribution. I'm merging it in and prepare a new release soon.

@haotianw465 haotianw465 merged commit 4d79dd5 into aws:master Sep 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants