python3: retries fail due to incompatible code #133

rseuchter · 2019-10-31T20:42:07Z

Recently, we started using aws-instance-scheduler 1.3.0. This morning I found some instances have not been started despite the scheduler trying to start them.
In the log I found this:

2019-10-31 - 05:21:06.117 - ERROR : Error starting instances i-*****************,i-*****************,i-*****************,i-*****************,i-*****************, ('ClientError' object has no attribute 'message')

Fast-forward a few hours of research and digging through the code (and first suspecting a problem in botocore): While I don't know what the exact underlying error I found what's wrong in handling this error.

In PEP 352 the message attribute of Python's Exception class was deprecated. In Python 3 exceptions no longer have a message. However, looking at boto_retry you'll find

return "throttling" in ex.message.lower()

This will fail using python3. As aws-instance-scheduler has been moved to python3 recently you start seeing this error.

My suggestion to fix this is handling the error in a way botocore suggests handling errors, see https://botocore.amazonaws.com/v1/documentation/api/latest/client_upgrades.html#error-handling (via https://stackoverflow.com/a/33663484)

Tomorrow, we'll try if downgrading the lambda function to python2 can be a workaround.

The text was updated successfully, but these errors were encountered:

rseuchter · 2019-11-01T08:40:37Z

A downgrade to Python 2 does not resolve this. The scheduler won't work at all. However, we have a hotfix and we do now see the underlying error which is InsufficientInstanceCapacity for our particular instance type in that region.

I plan to share the details of the hotfix as soon as I get to it.

rseuchter · 2019-11-01T14:08:00Z

Other than discussing the hotfix, which is basically only applying the pattern of https://github.com/awslabs/aws-instance-scheduler/blob/6f86b91fd641a2526c4c0b1c78245e41f3406efb/source/code/boto_retry/ec2_service_retry.py#L40 to throttling in https://github.com/awslabs/aws-instance-scheduler/blob/6f86b91fd641a2526c4c0b1c78245e41f3406efb/source/code/boto_retry/aws_service_retry.py#L52 I'd love to discuss the need for api_throttled() in the first place.

Looking at throttling in botocore you may find different types of ThrottlingException at _retry.json and the likes. However, searching the AWS documentation you hardly find any mentioning of this error code related to EC2 or RDS. In fact, these errors appear to be deprecated in those services. ec2_service_retry.py handles RequestLimitExceeded throttling exceptions.

Taking it even further would be to leave retries solely in the hands of botocore. That is, however, beyond the scope of this bug.

mohsenari · 2019-12-03T21:13:50Z

Hi @rseuchter Thanks for submitting the issue. We confirmed that the exception message property does not work with Python version 3. We will fix this issue and publish it here. Regarding the need for api_throttled(), as you mentioned, it is beyond the scope of this issue. If you'd like to discuss, I recommend opening a separate issue.

rseuchter · 2019-12-04T21:14:30Z

Thank you for taking this up.

Let me clarify the point with api_throttled(). Right now I think that removing it from the code is a perfectly valid approach to fixing this particular bug. The rationale is that I don't see strong indicators that the errors it is trying to handle are still thrown by the APIs of EC2 or RDS.

Putting retries (and thereby handling of RequestLimitExceeded) in the hands of botocore is what I think deserves a separate discussion.

rseuchter · 2020-02-21T20:51:07Z

I just hit this bug again in another account using a pretty different setup. The bottom line is this:

This bug masks all sorts of errors and often blocks people from debugging problems with using or setting up the instance scheduler.

In the latest occurence I disabled api_throttled() and was able to understand and debug the underlying problem.

cgol · 2020-02-24T06:03:40Z

+1 We're also being impacted by this because our Instance Scheduler is stopping but not starting instances. However we can't see the cause in the logs due to this error.

chaitand28 · 2020-03-10T20:45:40Z

This issue has been fixed in the release 1.3.1. Please deploy the latest template to get the updated code.

lovepurohit mentioned this issue Feb 18, 2020

Version 1.3 logging not writing to correct Logstream #147

Closed

rseuchter mentioned this issue Feb 20, 2020

Fix error handling incompatible with Python 3 #153

Closed

leavertj closed this as completed Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python3: retries fail due to incompatible code #133

python3: retries fail due to incompatible code #133

rseuchter commented Oct 31, 2019

rseuchter commented Nov 1, 2019

rseuchter commented Nov 1, 2019

mohsenari commented Dec 3, 2019

rseuchter commented Dec 4, 2019

rseuchter commented Feb 21, 2020

cgol commented Feb 24, 2020

chaitand28 commented Mar 10, 2020

python3: retries fail due to incompatible code #133

python3: retries fail due to incompatible code #133

Comments

rseuchter commented Oct 31, 2019

rseuchter commented Nov 1, 2019

rseuchter commented Nov 1, 2019

mohsenari commented Dec 3, 2019

rseuchter commented Dec 4, 2019

rseuchter commented Feb 21, 2020

cgol commented Feb 24, 2020

chaitand28 commented Mar 10, 2020