Skip to content

Missing credentials in config happening intermittently #692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidporter-id-au opened this issue Aug 27, 2015 · 47 comments · Fixed by mapbox/tilelive-s3#92, bbc/sqs-consumer#90 or #1114
Labels
feature-request A feature should be added or improved.

Comments

@davidporter-id-au
Copy link

We've been having some difficulties in working that the SDK is intermittently unable to fetch credentials and this renders our application unauthorised. The ec2 where this is occurring has a particular IAM role and the SDK is therefore reaching out to the metadata endpoint (169.254...) to fetch it's keys. However, when it does so it occasionally appears to throw this type of error:

So, for example this dynamoDB was logged by our application with an SDK error:

{
    "error": {
        "message": "Missing credentials in config",
        "code": "CredentialsError",
        "errno": "ECONNREFUSED",
        "syscall": "connect",
        "time": "2015-07-15T21:55:06.083Z",
        "originalError": {
            "message": "Could not load credentials from EC2MetadataCredentials",
            "code": "CredentialsError",
            "errno": "ECONNREFUSED",
            "syscall": "connect",
            "time": "2015-07-15T21:55:06.083Z",
            "originalError": {
                "code": "ECONNREFUSED",
                "errno": "ECONNREFUSED",
                "syscall": "connect",
                "message": "connect ECONNREFUSED"
            }
        }
    },
    "level": "error",
    "message": "DynamoDB Query failed",
    "timestamp": "2015-07-15T21:55:06.087Z"
}

More recently, this S3 call had this similar error:

...
    "originalError": {
      "message": "Could not load credentials from any providers",
      "code": "CredentialsError",
      "errno": "ECONNREFUSED",
      "syscall": "connect",
      "address": "169.254.169.254",
      "port": 80,
      "time": "2015-08-26T06:08:18.008Z",
      "originalError": {
        "code": "ECONNREFUSED",
        "errno": "ECONNREFUSED",
        "syscall": "connect",
        "address": "169.254.169.254",
        "port": 80,
        "message": "connect ECONNREFUSED 169.254.169.254:80"
      }
    }
...

We've experienced the problem with multiple applications intermittently, but as frequently as half a dozen times per day on a single ec2. We're using NodeJS aws-sdk version 2.1.46 in the example above and iojs 2.3.1 here, nodeJS 0.12.x elsewhere. We're in the ap-southeast-2 region.

While it would appear that the connection's being refused, I'd be surprised to see this endpoint actually go down. Is it possible we're doing something stupid with node to create this, or else possibly there be a genuine issue?

@AdityaManohar
Copy link
Contributor

@davidporter-id-au It looks like the EC2 metadata service is throttling requests from your code. The SDK itself does cache credentials fetched from the metadata service, so multiple simultaneous requests don't bombard the metadata service. See #448

Is your code part of a shell script that is invoked in a loop of some sort? Hitting the metadata service multiple times in succession can cause the requests to be throttled.

@weevilgenius
Copy link

We've been seeing the same issue when our EC2 instances are under heavy load and the application must make many requests to S3 within a short time.

We're also using IAM roles applied to EC2 instances, and there are no other applications, cron jobs, or scripts other than a single node.js instance which is using the latest AWS SDK (2.1.49).

Sample error message:

TimeoutError: Missing credentials in config
    at ClientRequest.<anonymous> (/opt/cloudio-server/node_modules/aws-sdk/lib/http/node.js:56:34)
    at ClientRequest.g (events.js:260:16)
    at emitNone (events.js:67:13)
    at ClientRequest.emit (events.js:166:7)
    at Socket.emitTimeout (_http_client.js:534:10)
    at Socket.g (events.js:260:16)
    at emitNone (events.js:67:13)
    at Socket.emit (events.js:166:7)
    at Socket._onTimeout (net.js:318:8)
    at Timer.unrefTimeout (timers.js:510:13)

@davidporter-id-au
Copy link
Author

@AdityaManohar So your point about the endpoint being throttled was my first thought. Regarding the script starting it regularly, no, it's a (koajs) webserver, so it starts and runs indefinitely.

I put a console.log where I thought I could see the request being processed in the sdk to see if it was being called multiple times and observed that the metadata was being processed on startup, and not thereafter.

I verified this also by intentionally creating a worst-case scenario: require()ing within a loop, which showed the credentials being fetched each time. This is not what we're seeing in our production app. So I'm certainly not going to preclude us doing something stupid, but I don't think we're hammering the metadata endpoint.

We have since also discovered that a delayed retry appears to resolve the issue. However, this is a kludgy workaround rather than something I'd like to rely on.

@weevilgenius
Copy link

I tracked down a detailed error message for my case:

{
  "message": "Missing credentials in config",
  "code": "CredentialsError",
  "time": "Thu Sep 03 2015 17:17:33 GMT+0000 (UTC)",
  "originalError": {
    "message": "Could not load credentials from any providers",
    "code": "CredentialsError",
    "time": "Thu Sep 03 2015 17:17:33 GMT+0000 (UTC)",
    "originalError": {
      "message": "Connection timed out after 1000ms",
      "code": "TimeoutError",
      "time": "Thu Sep 03 2015 17:17:33 GMT+0000 (UTC)"
    }
  }
}

I see we're getting a connection timeout when trying to load credentials instead of a connection refused. That might be a different issue, even though the top level error is the same.

@stemail23
Copy link

I'm seeing exactly this issue too. It is easily reproduced by simply having a script that gets a bunch of processes to create a heap of aws-sdk instances and then consume an api endpoint on them. (I realise that this is not a realistic situation, but it allows an intermittent issue to be reproduced reliably)

The example code that I use to reproduce this on a t2.micro instance is:

var async = require('async');

var handlers = [];
var addHandler = function(value) {
    handlers.push(function(callback) {
        var queueName = 'your queue name here';
        var region = 'ap-southeast-2';
        var createQueueParams = { QueueName: queueName };
        var aws = require('aws-sdk');
        aws.config.region = region;
        var sqs = new aws.SQS();
        sqs.createQueue(createQueueParams, function(err, data) {
            if (err) { return callback(err); }
            var params = {
                QueueUrl: data.QueueUrl,
                MaxNumberOfMessages: 10,
                WaitTimeSeconds: 2
            };
            sqs.receiveMessage(params, callback);
        });
    });
};

for (var x = 0; x < 500; x += 1) {
    addHandler(x);
}

async.parallelLimit(handlers, 100, function(err, results) {
    if (err) { return console.error(err); }
    console.log(results.length);
});

If I invoke this code from 10 different node processes simultaneously, then I can pretty much guarantee that the error will be raised (returned in the err on sqs.createQueue)

There is a bigger problem associated with this situation however. I have found that after encountering the issue:

a) The EC2 instance becomes unreliable and typically is pretty much a write-off. Usually I cannot SSH into the machine, and the only recourse has been to terminate (even restart often fails).

b) The biggest issue of all: Even though the EC2 instance is effectively dead and unreachable, The EC2 console still reports it as healthy, AND therefore any autoscaler that instantiated the instance is unaware of the failure, and does not therefore replace the instance. In my use case, I'm using an autoscaler group with desired = 1 to ensure failover on my instances. Due to this issue I CANNOT rely on instance monitoring on autoscalers.

It occurs to me that the resolution to this problem ought to be relatively trivial in the aws-sdk (surely just an incrementally backing-off retry on retrieving the credentials), but I'm concerned that the EC2 instance issues I'm seeing associated with this issue are symptomatic of a deeper underlying bug in the credentials endpoint code on the instance itself.

@AdityaManohar
Copy link
Contributor

@stemail23

If I invoke this code from 10 different node processes simultaneously, then I can pretty much guarantee that the error will be raised

If you are spawning multiple Node.js processes you are more likely to be throttled by the EC2 metadata service. The SDK itself will cache credentials after the first fetch. require()-ing the SDK multiple times is going to cause credentials to be fetched multiple times - once for each instance of the SDK.

It looks like some of the other issues that you are having are related to EC2 instance itself and not the SDK. I would recommend opening up an issue on the Amazon EC2 Forum.

In the meantime, we can definitely look at adding retries and exponential back-off to the EC2 metadata service requests.

@stemail23
Copy link

Yep, I understand why I see the issue, I built the scenario explicitly to expose it!

The simple facts: it is possible, in fact innevitable, using only AWS products (EC2 & the SDK), to bring an EC2 instance to its knees. Above are outlined the exact steps to reproduce the situation. What's frustrating to me, as a customer, is the difficulty I'm having raising this as a bug report. I guess I assumed that there would be internal process to route it to the appropriate place, but instead I keep getting redirected myself.

@davidporter-id-au
Copy link
Author

@stemail23 Just as an aside, the healthcheck behaviour you're seeing is - I think - expected. You need to switch your autoscaling-group to use elb as a healthcheck means rather than the default ec2 healthcheck - something which I believe just uses the hypervisor's system healthchecks and which will not be aware of any network effects. Once you've done so, the scaling group will rely on the HTTP or TCP healthcheck and kill and scale accordingly. I'm no expert on the subject, but I recall my team having to make that explicit fix.

@AdityaManohar We have had some success in addressing the issue with crude retries. If the SDK were able do this without intervention, while also handling backoff that would be good.

@seriousben
Copy link

We've also been hit by this today intermittently on code that was working fine before...

@stemail23
Copy link

@davidporter-id-au

Thanks for the suggestion. Unfortunately, in my case, I don't have an ELB in the equation on these instances (they're job handler machines lifting messages from SQS). I'm exploring other options where I have a monitor machine attempting to recognise the dead instances and terminating them, but it's frustrating to have to expend this effort!

@stemail23
Copy link

We've also been hit by this today intermittently on code that was working fine before...

Exactly, which is why I suspect that some change in EC2 is complicit in the situation, rather than being solely an AWS-SDK issue.

@davidporter-id-au
Copy link
Author

I suppose there are two issues: There is the single-point-of-failure this reveals in the SDK for this kind of authentication and there's the probable infrastructure problem we're seeing where the metadata endpoint is subject to transient failure. For the latter I had created a ticket, but let it expire. I'll follow that up.

@stemail23 @seriousben I notice you're responding when I am. Are you ap-southeast-2 for reference?

@stemail23
Copy link

@davidporter-id-au Yes, I'm in Sydney

@seriousben
Copy link

@davidporter-id-au - us-east for us.

@AdityaManohar
Copy link
Contributor

@davidporter-id-au @stemail23 @seriousben
You can try increasing the timeout of the AWS.EC2MetadataCredentials provider by setting the httpOptions.timout options. This defaults to 1000 ms.

var AWS = require('aws-sdk');
AWS.config.credentials = new AWS.EC2MetadataCredentials({
  httpOptions: { timeout: 4000 }
});

This should help alleviate some of the issues with a slow responding metadata service.

@areichman
Copy link

We started seeing the issue this week as well. For us, it happened when we updated our Node install to version 4.2.2 instead of 0.10.17. Our process runs on a cron tab every 15 minutes and sends about 20K messages to SQS. With 0.10.17, we ran with no issues. Within 30 minutes of updating to 4.2.2 we started seeing the intermittent issues. In both cases, we had the same 2.2.18 version of the SDK.

A similar issue was discussed here in the past: #445

@willwhite and @mick, have you seen any similar issues since your update was added to the SDK?

@zbjornson
Copy link

This started happening for us recently. Sporadically when uploading to S3 from the nodejs SDK (v2.2.11 and 2.2.33) we would get the same error posted in #692 (comment). Increasing the timeout to 4000 ms didn't fix it; increasing it to 10000 ms did.

We're also not hammering the endpoint (in fact our test server was making a single request at a time) -- it seems like it's a laggy metadata provider endpoint given that the timeout alleviates it.

@rfink
Copy link

rfink commented Apr 11, 2016

Also having this issue +1

@seriousben
Copy link

We fixed this by using only one instance of the sdk.

@bbarney
Copy link

bbarney commented May 6, 2016

We are seeing this too. Can't be a throttling issue, it is on a staging instance that is only hit a few times per hour. Additionally, it is happening at application startup, so the server never starts.

@Glavin001
Copy link

Glavin001 commented May 11, 2016

Also experiencing this issue. I tried increasing the timeout to 10 seconds to no avail.

@bbarney have you found any workarounds? I am experiencing the same issue on startup, every single time.

@rogerwaldvogel
Copy link

Also having this issue

@samuelsensei
Copy link

Having the same problem here. Problem comes and goes. Especially happens when I register a new user and log them in.

@dparmar74
Copy link

Having the same issue. Increasing timeout didn't help

@ApsOps
Copy link

ApsOps commented Jun 23, 2016

Same issue while using SQS for us. We're using a single instance of SDK object.

@rfink
Copy link

rfink commented Jun 28, 2016

Same here, single instance of SDK, still problems.

@juanstiza
Copy link

Same here, it is happening with S3... Strangely enough, it works on Ubuntu but not Mac, I'll have to check my network settings.

@codan84
Copy link

codan84 commented Aug 17, 2016

We have an application running as a CRON job executed every 10s on an EC2 instance, and we see this issue very frequently. Since the application runs for about 3-4s every 10s, we "request" AWS-SDK each time we start the app. Is there any way around this issue for a scenario like this?

@seriousben
Copy link

seriousben commented Aug 18, 2016

You could change this code to start your loop only after AWS.config.getCredentials(cb) finishes. Otherwise you fire async s3 operations at the same time and they all think (and they are right) that they need to fetch credentials.

@codan84
Copy link

codan84 commented Aug 18, 2016

Indeed, that's essentially what I did. I have updated my comment above with the solution.

@seriousben
Copy link

Making sure AWS.config.getCredentials is called before any AWS operations AND making sure to reuse the same AWS.S3 instance everywhere. Will get rid of this problem for good.

@rfink
Copy link

rfink commented Aug 18, 2016

Oddly enough, this is still happening to me even when specifying the credentials in environment variables.

@dnorth98
Copy link

We're seeing this too now on apps running in node under ECS (using role credentials of course). Is there any signs this will be fixed in the future?

@LiuJoyceC
Copy link
Contributor

Hi,

We are actively still looking at this issue and appreciate your patience.

@dnorth98 Is the error you're getting when running on ECS the same "missing credentials in config" error and is it also intermittent? Can you confirm that the SDK is hitting the ECS credential endpoint rather than hitting the EC2 Metadata service? Thanks

@dnorth98
Copy link

@LiuJoyceC We are getting the same credentials error (it's actually when making a dynamoDB call)

Missing credentials in config. It's highly intermittent (maybe 1 call in 1000) but noticeable enough to cause errors in the client service.

Regarding how we get the credentials, we're not hitting the metadata service directly. We're just initializing the dynamo object without passing in specific credentials (ie. use the role creds).

@LiuJoyceC
Copy link
Contributor

Hi @codan84
How are you verifying that the SDK is hitting the EC2 Metadata Service endpoint each time you call s3.upload()? Are you checking the number of instances of EC2MetadataCredentials that are created, or are you actually checking requests made to the endpoint? When I tried to reproduce the error with the code you provided above, indeed 200 instances of EC2MetadataCredentials are instantiated, but only one instance actually made a request to the Metadata Service. You can verify this by mocking (or logging something inside) the request() function on the MetadataService class (lib/metadata_service.js) and running your example code again.

The reason for this is due to the implementation of the loadCredentials function in the MetadataService class. A queue of callbacks is maintained, and as long as the queue is longer than one, no request to the Metadata Service is made. That means only one request can be in flight at once, and when the response comes back, all of the callbacks in the queue are called, so they don't each need to make a request.

@stemail23
I wasn't able to reproduce the error with the code snippet you provided (on Sept 6) either, for the same reason as above. Although the code is making hundreds of request to SQS, they don't each make a request to the EC2 Metadata Service, and no matter how much I fudged with the number of times the for loop is called or the parallel limit, I couldn't get the Metadata Service to return an error (I was also running on a t2.micro instance). (And it doesn't make a difference how many times you call require('aws-sdk'). require() caches module exports, and subsequent calls to require() will simply retrieve from the cache, so there aren't multiple aws-sdk instances being created.)

That said, even though only one request to Metadata Service can be in flight at a time, it is possible to hit the Metadata Service too many times in a short time span (you can keep hitting the Metadata Service as soon the previous response comes back). Given that this error is intermittent ( @dnorth98 mentioned that is happens about 1 out of 1000 times), implementing retries with exponential backoff would likely resolve the problem, as it is unlikely that the 2nd or 3rd try will get the error again. I am actively working on that now and will provide an update when it is finished. Since it was reported above that this problem has also occurred on ECS, I can also implement the exponential backoff in the ECSCredentials provider.

@stemail23
Copy link

@LiuJoyceC Thanks for the feedback. I was able to reliably reproduce the issue with the code I provided, but I admit, I haven't looked into it since, so it's possible that things have changed since then. I notice though that you don't mention running multiple processes however, so perhaps that indicates why you couldn't reproduce? To reproduce the issue I needed to run the provided script up to ten times concurrently.

Thanks for looking into the issue. Hopefully you'll have some success with backed off retries, and hopefully the suggestions above might help you test a fix if you can reproduce the problem.

Cheers!

@LiuJoyceC
Copy link
Contributor

Hi,

The PR for retrying EC2MetadataCredentials and ECSCredentials has been merged to master, so you can try it out now by cloning the repo, or you can wait for the next release of the SDK in NPM. By default it times out after 1000ms and retries up to 3 times with a base delay of 100ms. If you still get intermittent timeout errors even with this default retry behavior, you can try increasing the timeout, the max retries, and the retry delay:

AWS.config.credentials = new AWS.EC2MetadataCredentials({
    httpOptions: { timeout: 5000 },
    maxRetries: 10,
    retryDelayOptions: { base: 200 }
});

If that still doesn't work, please let me know!

@stemail23
Copy link

Thanks @LiuJoyceC

@rfink
Copy link

rfink commented Oct 24, 2016

So this is still happening in v2.6.9 on an EC2 instance (utilizing elastic beanstalk).

{"message":"Missing credentials in config","name":"CredentialsError","stack":"Error: connect ECONNREFUSED 169.254.169.254:80\n at Object.exports._errnoException (util.js:874:11)\n at exports._exceptionWithHostPort (util.js:897:20)\n at TCPConnectWrap.afterConnect as oncomplete","code":"CredentialsError"}

@ckknight
Copy link

Ran into this issue locally - was due to some shenanigans with process.env.

Fix was to manually pass in accessKey and secretAccessKey to aws.config.update(...).

@JoeMcGuire
Copy link

I just hit this issue on AWS ECS (Elastic Container Service) which requires ECSCredentials instead of EC2MetadataCredentials.

AWS.config.credentials = new AWS.ECSCredentials({
  httpOptions: { timeout: 5000 },
  maxRetries: 10,
  retryDelayOptions: { base: 200 }
})

@pe8ter
Copy link

pe8ter commented Jun 5, 2018

@LiuJoyceC Should this credentials timeout configuration be created once per require of the AWS SDK, or once globally for an entire application?

@rfink
Copy link

rfink commented Nov 22, 2018

Still happening for me in ECS with aws-sdk version 2.270.1 and node.js version 10.11

@lock
Copy link

lock bot commented Sep 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-request A feature should be added or improved.
Projects
None yet