Skip to content

python3: retries fail due to incompatible code #133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rseuchter opened this issue Oct 31, 2019 · 7 comments
Closed

python3: retries fail due to incompatible code #133

rseuchter opened this issue Oct 31, 2019 · 7 comments

Comments

@rseuchter
Copy link

Recently, we started using aws-instance-scheduler 1.3.0. This morning I found some instances have not been started despite the scheduler trying to start them.
In the log I found this:

2019-10-31 - 05:21:06.117 - ERROR : Error starting instances i-*****************,i-*****************,i-*****************,i-*****************,i-*****************, ('ClientError' object has no attribute 'message')

Fast-forward a few hours of research and digging through the code (and first suspecting a problem in botocore): While I don't know what the exact underlying error I found what's wrong in handling this error.

In PEP 352 the message attribute of Python's Exception class was deprecated. In Python 3 exceptions no longer have a message. However, looking at boto_retry you'll find

return "throttling" in ex.message.lower()

This will fail using python3. As aws-instance-scheduler has been moved to python3 recently you start seeing this error.

My suggestion to fix this is handling the error in a way botocore suggests handling errors, see https://botocore.amazonaws.com/v1/documentation/api/latest/client_upgrades.html#error-handling (via https://stackoverflow.com/a/33663484)

Tomorrow, we'll try if downgrading the lambda function to python2 can be a workaround.

@rseuchter
Copy link
Author

A downgrade to Python 2 does not resolve this. The scheduler won't work at all. However, we have a hotfix and we do now see the underlying error which is InsufficientInstanceCapacity for our particular instance type in that region.

I plan to share the details of the hotfix as soon as I get to it.

@rseuchter
Copy link
Author

Other than discussing the hotfix, which is basically only applying the pattern of https://github.com/awslabs/aws-instance-scheduler/blob/6f86b91fd641a2526c4c0b1c78245e41f3406efb/source/code/boto_retry/ec2_service_retry.py#L40 to throttling in https://github.com/awslabs/aws-instance-scheduler/blob/6f86b91fd641a2526c4c0b1c78245e41f3406efb/source/code/boto_retry/aws_service_retry.py#L52 I'd love to discuss the need for api_throttled() in the first place.

Looking at throttling in botocore you may find different types of ThrottlingException at _retry.json and the likes. However, searching the AWS documentation you hardly find any mentioning of this error code related to EC2 or RDS. In fact, these errors appear to be deprecated in those services. ec2_service_retry.py handles RequestLimitExceeded throttling exceptions.

Taking it even further would be to leave retries solely in the hands of botocore. That is, however, beyond the scope of this bug.

@mohsenari
Copy link

Hi @rseuchter Thanks for submitting the issue. We confirmed that the exception message property does not work with Python version 3. We will fix this issue and publish it here. Regarding the need for api_throttled(), as you mentioned, it is beyond the scope of this issue. If you'd like to discuss, I recommend opening a separate issue.

@rseuchter
Copy link
Author

Thank you for taking this up.

Let me clarify the point with api_throttled(). Right now I think that removing it from the code is a perfectly valid approach to fixing this particular bug. The rationale is that I don't see strong indicators that the errors it is trying to handle are still thrown by the APIs of EC2 or RDS.

Putting retries (and thereby handling of RequestLimitExceeded) in the hands of botocore is what I think deserves a separate discussion.

@rseuchter
Copy link
Author

I just hit this bug again in another account using a pretty different setup. The bottom line is this:

This bug masks all sorts of errors and often blocks people from debugging problems with using or setting up the instance scheduler.

In the latest occurence I disabled api_throttled() and was able to understand and debug the underlying problem.

@cgol
Copy link

cgol commented Feb 24, 2020

+1 We're also being impacted by this because our Instance Scheduler is stopping but not starting instances. However we can't see the cause in the logs due to this error.

@chaitand28
Copy link

This issue has been fixed in the release 1.3.1. Please deploy the latest template to get the updated code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants