`Missing credentials in config` happening intermittently #692

davidporter-id-au · 2015-08-27T01:08:19Z

We've been having some difficulties in working that the SDK is intermittently unable to fetch credentials and this renders our application unauthorised. The ec2 where this is occurring has a particular IAM role and the SDK is therefore reaching out to the metadata endpoint (169.254...) to fetch it's keys. However, when it does so it occasionally appears to throw this type of error:

So, for example this dynamoDB was logged by our application with an SDK error:

{
    "error": {
        "message": "Missing credentials in config",
        "code": "CredentialsError",
        "errno": "ECONNREFUSED",
        "syscall": "connect",
        "time": "2015-07-15T21:55:06.083Z",
        "originalError": {
            "message": "Could not load credentials from EC2MetadataCredentials",
            "code": "CredentialsError",
            "errno": "ECONNREFUSED",
            "syscall": "connect",
            "time": "2015-07-15T21:55:06.083Z",
            "originalError": {
                "code": "ECONNREFUSED",
                "errno": "ECONNREFUSED",
                "syscall": "connect",
                "message": "connect ECONNREFUSED"
            }
        }
    },
    "level": "error",
    "message": "DynamoDB Query failed",
    "timestamp": "2015-07-15T21:55:06.087Z"
}

More recently, this S3 call had this similar error:

...
    "originalError": {
      "message": "Could not load credentials from any providers",
      "code": "CredentialsError",
      "errno": "ECONNREFUSED",
      "syscall": "connect",
      "address": "169.254.169.254",
      "port": 80,
      "time": "2015-08-26T06:08:18.008Z",
      "originalError": {
        "code": "ECONNREFUSED",
        "errno": "ECONNREFUSED",
        "syscall": "connect",
        "address": "169.254.169.254",
        "port": 80,
        "message": "connect ECONNREFUSED 169.254.169.254:80"
      }
    }
...

We've experienced the problem with multiple applications intermittently, but as frequently as half a dozen times per day on a single ec2. We're using NodeJS aws-sdk version 2.1.46 in the example above and iojs 2.3.1 here, nodeJS 0.12.x elsewhere. We're in the ap-southeast-2 region.

While it would appear that the connection's being refused, I'd be surprised to see this endpoint actually go down. Is it possible we're doing something stupid with node to create this, or else possibly there be a genuine issue?

The text was updated successfully, but these errors were encountered:

AdityaManohar · 2015-08-31T22:02:36Z

@davidporter-id-au It looks like the EC2 metadata service is throttling requests from your code. The SDK itself does cache credentials fetched from the metadata service, so multiple simultaneous requests don't bombard the metadata service. See #448

Is your code part of a shell script that is invoked in a loop of some sort? Hitting the metadata service multiple times in succession can cause the requests to be throttled.

weevilgenius · 2015-09-02T23:42:14Z

We've been seeing the same issue when our EC2 instances are under heavy load and the application must make many requests to S3 within a short time.

We're also using IAM roles applied to EC2 instances, and there are no other applications, cron jobs, or scripts other than a single node.js instance which is using the latest AWS SDK (2.1.49).

Sample error message:

TimeoutError: Missing credentials in config
    at ClientRequest.<anonymous> (/opt/cloudio-server/node_modules/aws-sdk/lib/http/node.js:56:34)
    at ClientRequest.g (events.js:260:16)
    at emitNone (events.js:67:13)
    at ClientRequest.emit (events.js:166:7)
    at Socket.emitTimeout (_http_client.js:534:10)
    at Socket.g (events.js:260:16)
    at emitNone (events.js:67:13)
    at Socket.emit (events.js:166:7)
    at Socket._onTimeout (net.js:318:8)
    at Timer.unrefTimeout (timers.js:510:13)

davidporter-id-au · 2015-09-03T00:05:59Z

@AdityaManohar So your point about the endpoint being throttled was my first thought. Regarding the script starting it regularly, no, it's a (koajs) webserver, so it starts and runs indefinitely.

I put a console.log where I thought I could see the request being processed in the sdk to see if it was being called multiple times and observed that the metadata was being processed on startup, and not thereafter.

I verified this also by intentionally creating a worst-case scenario: require()ing within a loop, which showed the credentials being fetched each time. This is not what we're seeing in our production app. So I'm certainly not going to preclude us doing something stupid, but I don't think we're hammering the metadata endpoint.

We have since also discovered that a delayed retry appears to resolve the issue. However, this is a kludgy workaround rather than something I'd like to rely on.

weevilgenius · 2015-09-03T17:36:12Z

I tracked down a detailed error message for my case:

{
  "message": "Missing credentials in config",
  "code": "CredentialsError",
  "time": "Thu Sep 03 2015 17:17:33 GMT+0000 (UTC)",
  "originalError": {
    "message": "Could not load credentials from any providers",
    "code": "CredentialsError",
    "time": "Thu Sep 03 2015 17:17:33 GMT+0000 (UTC)",
    "originalError": {
      "message": "Connection timed out after 1000ms",
      "code": "TimeoutError",
      "time": "Thu Sep 03 2015 17:17:33 GMT+0000 (UTC)"
    }
  }
}

I see we're getting a connection timeout when trying to load credentials instead of a connection refused. That might be a different issue, even though the top level error is the same.

stemail23 · 2015-09-07T01:55:32Z

I'm seeing exactly this issue too. It is easily reproduced by simply having a script that gets a bunch of processes to create a heap of aws-sdk instances and then consume an api endpoint on them. (I realise that this is not a realistic situation, but it allows an intermittent issue to be reproduced reliably)

The example code that I use to reproduce this on a t2.micro instance is:

var async = require('async');

var handlers = [];
var addHandler = function(value) {
    handlers.push(function(callback) {
        var queueName = 'your queue name here';
        var region = 'ap-southeast-2';
        var createQueueParams = { QueueName: queueName };
        var aws = require('aws-sdk');
        aws.config.region = region;
        var sqs = new aws.SQS();
        sqs.createQueue(createQueueParams, function(err, data) {
            if (err) { return callback(err); }
            var params = {
                QueueUrl: data.QueueUrl,
                MaxNumberOfMessages: 10,
                WaitTimeSeconds: 2
            };
            sqs.receiveMessage(params, callback);
        });
    });
};

for (var x = 0; x < 500; x += 1) {
    addHandler(x);
}

async.parallelLimit(handlers, 100, function(err, results) {
    if (err) { return console.error(err); }
    console.log(results.length);
});

If I invoke this code from 10 different node processes simultaneously, then I can pretty much guarantee that the error will be raised (returned in the err on sqs.createQueue)

There is a bigger problem associated with this situation however. I have found that after encountering the issue:

a) The EC2 instance becomes unreliable and typically is pretty much a write-off. Usually I cannot SSH into the machine, and the only recourse has been to terminate (even restart often fails).

b) The biggest issue of all: Even though the EC2 instance is effectively dead and unreachable, The EC2 console still reports it as healthy, AND therefore any autoscaler that instantiated the instance is unaware of the failure, and does not therefore replace the instance. In my use case, I'm using an autoscaler group with desired = 1 to ensure failover on my instances. Due to this issue I CANNOT rely on instance monitoring on autoscalers.

It occurs to me that the resolution to this problem ought to be relatively trivial in the aws-sdk (surely just an incrementally backing-off retry on retrieving the credentials), but I'm concerned that the EC2 instance issues I'm seeing associated with this issue are symptomatic of a deeper underlying bug in the credentials endpoint code on the instance itself.

AdityaManohar · 2015-09-08T18:20:04Z

@stemail23

If I invoke this code from 10 different node processes simultaneously, then I can pretty much guarantee that the error will be raised

If you are spawning multiple Node.js processes you are more likely to be throttled by the EC2 metadata service. The SDK itself will cache credentials after the first fetch. require()-ing the SDK multiple times is going to cause credentials to be fetched multiple times - once for each instance of the SDK.

It looks like some of the other issues that you are having are related to EC2 instance itself and not the SDK. I would recommend opening up an issue on the Amazon EC2 Forum.

In the meantime, we can definitely look at adding retries and exponential back-off to the EC2 metadata service requests.

stemail23 · 2015-09-08T21:18:04Z

Yep, I understand why I see the issue, I built the scenario explicitly to expose it!

The simple facts: it is possible, in fact innevitable, using only AWS products (EC2 & the SDK), to bring an EC2 instance to its knees. Above are outlined the exact steps to reproduce the situation. What's frustrating to me, as a customer, is the difficulty I'm having raising this as a bug report. I guess I assumed that there would be internal process to route it to the appropriate place, but instead I keep getting redirected myself.

davidporter-id-au · 2015-09-08T22:52:08Z

@stemail23 Just as an aside, the healthcheck behaviour you're seeing is - I think - expected. You need to switch your autoscaling-group to use elb as a healthcheck means rather than the default ec2 healthcheck - something which I believe just uses the hypervisor's system healthchecks and which will not be aware of any network effects. Once you've done so, the scaling group will rely on the HTTP or TCP healthcheck and kill and scale accordingly. I'm no expert on the subject, but I recall my team having to make that explicit fix.

@AdityaManohar We have had some success in addressing the issue with crude retries. If the SDK were able do this without intervention, while also handling backoff that would be good.

seriousben · 2015-09-08T23:28:07Z

We've also been hit by this today intermittently on code that was working fine before...

stemail23 · 2015-09-08T23:28:18Z

@davidporter-id-au

Thanks for the suggestion. Unfortunately, in my case, I don't have an ELB in the equation on these instances (they're job handler machines lifting messages from SQS). I'm exploring other options where I have a monitor machine attempting to recognise the dead instances and terminating them, but it's frustrating to have to expend this effort!

stemail23 · 2015-09-08T23:33:42Z

We've also been hit by this today intermittently on code that was working fine before...

Exactly, which is why I suspect that some change in EC2 is complicit in the situation, rather than being solely an AWS-SDK issue.

davidporter-id-au · 2015-09-08T23:41:40Z

I suppose there are two issues: There is the single-point-of-failure this reveals in the SDK for this kind of authentication and there's the probable infrastructure problem we're seeing where the metadata endpoint is subject to transient failure. For the latter I had created a ticket, but let it expire. I'll follow that up.

@stemail23 @seriousben I notice you're responding when I am. Are you ap-southeast-2 for reference?

stemail23 · 2015-09-09T05:52:23Z

@davidporter-id-au Yes, I'm in Sydney

seriousben · 2015-09-09T10:48:06Z

@davidporter-id-au - us-east for us.

AdityaManohar · 2015-09-14T22:00:55Z

@davidporter-id-au @stemail23 @seriousben
You can try increasing the timeout of the AWS.EC2MetadataCredentials provider by setting the httpOptions.timout options. This defaults to 1000 ms.

var AWS = require('aws-sdk');
AWS.config.credentials = new AWS.EC2MetadataCredentials({
  httpOptions: { timeout: 4000 }
});

This should help alleviate some of the issues with a slow responding metadata service.

areichman · 2015-12-08T18:51:59Z

We started seeing the issue this week as well. For us, it happened when we updated our Node install to version 4.2.2 instead of 0.10.17. Our process runs on a cron tab every 15 minutes and sends about 20K messages to SQS. With 0.10.17, we ran with no issues. Within 30 minutes of updating to 4.2.2 we started seeing the intermittent issues. In both cases, we had the same 2.2.18 version of the SDK.

A similar issue was discussed here in the past: #445

@willwhite and @mick, have you seen any similar issues since your update was added to the SDK?

zbjornson · 2016-02-02T00:28:23Z

This started happening for us recently. Sporadically when uploading to S3 from the nodejs SDK (v2.2.11 and 2.2.33) we would get the same error posted in #692 (comment). Increasing the timeout to 4000 ms didn't fix it; increasing it to 10000 ms did.

We're also not hammering the endpoint (in fact our test server was making a single request at a time) -- it seems like it's a laggy metadata provider endpoint given that the timeout alleviates it.

rfink · 2016-04-11T18:18:44Z

Also having this issue +1

seriousben · 2016-04-11T18:33:32Z

We fixed this by using only one instance of the sdk.

bbarney · 2016-05-06T15:59:36Z

We are seeing this too. Can't be a throttling issue, it is on a staging instance that is only hit a few times per hour. Additionally, it is happening at application startup, so the server never starts.

Glavin001 · 2016-05-11T21:29:21Z

Also experiencing this issue. I tried increasing the timeout to 10 seconds to no avail.

@bbarney have you found any workarounds? I am experiencing the same issue on startup, every single time.

rogerwaldvogel · 2016-05-25T09:18:15Z

Also having this issue

samuelsensei · 2016-05-27T09:19:24Z

Having the same problem here. Problem comes and goes. Especially happens when I register a new user and log them in.

dparmar74 · 2016-06-20T16:57:55Z

Having the same issue. Increasing timeout didn't help

ApsOps · 2016-06-23T06:11:55Z

Same issue while using SQS for us. We're using a single instance of SDK object.

rfink · 2016-06-28T14:05:45Z

Same here, single instance of SDK, still problems.

juanstiza · 2016-08-03T20:45:34Z

Same here, it is happening with S3... Strangely enough, it works on Ubuntu but not Mac, I'll have to check my network settings.

codan84 · 2016-08-17T10:34:05Z

We have an application running as a CRON job executed every 10s on an EC2 instance, and we see this issue very frequently. Since the application runs for about 3-4s every 10s, we "request" AWS-SDK each time we start the app. Is there any way around this issue for a scenario like this?

seriousben · 2016-08-18T13:27:46Z

You could change this code to start your loop only after AWS.config.getCredentials(cb) finishes. Otherwise you fire async s3 operations at the same time and they all think (and they are right) that they need to fetch credentials.

codan84 · 2016-08-18T13:29:19Z

Indeed, that's essentially what I did. I have updated my comment above with the solution.

seriousben · 2016-08-18T13:32:33Z

Making sure AWS.config.getCredentials is called before any AWS operations AND making sure to reuse the same AWS.S3 instance everywhere. Will get rid of this problem for good.

rfink · 2016-08-18T13:38:50Z

Oddly enough, this is still happening to me even when specifying the credentials in environment variables.

dnorth98 · 2016-08-23T18:10:52Z

We're seeing this too now on apps running in node under ECS (using role credentials of course). Is there any signs this will be fixed in the future?

LiuJoyceC · 2016-08-23T18:19:40Z

Hi,

We are actively still looking at this issue and appreciate your patience.

@dnorth98 Is the error you're getting when running on ECS the same "missing credentials in config" error and is it also intermittent? Can you confirm that the SDK is hitting the ECS credential endpoint rather than hitting the EC2 Metadata service? Thanks

dnorth98 · 2016-08-23T18:52:04Z

@LiuJoyceC We are getting the same credentials error (it's actually when making a dynamoDB call)

Missing credentials in config. It's highly intermittent (maybe 1 call in 1000) but noticeable enough to cause errors in the client service.

Regarding how we get the credentials, we're not hitting the metadata service directly. We're just initializing the dynamo object without passing in specific credentials (ie. use the role creds).

LiuJoyceC · 2016-08-24T02:09:22Z

Hi @codan84
How are you verifying that the SDK is hitting the EC2 Metadata Service endpoint each time you call s3.upload()? Are you checking the number of instances of EC2MetadataCredentials that are created, or are you actually checking requests made to the endpoint? When I tried to reproduce the error with the code you provided above, indeed 200 instances of EC2MetadataCredentials are instantiated, but only one instance actually made a request to the Metadata Service. You can verify this by mocking (or logging something inside) the request() function on the MetadataService class (lib/metadata_service.js) and running your example code again.

The reason for this is due to the implementation of the loadCredentials function in the MetadataService class. A queue of callbacks is maintained, and as long as the queue is longer than one, no request to the Metadata Service is made. That means only one request can be in flight at once, and when the response comes back, all of the callbacks in the queue are called, so they don't each need to make a request.

@stemail23
I wasn't able to reproduce the error with the code snippet you provided (on Sept 6) either, for the same reason as above. Although the code is making hundreds of request to SQS, they don't each make a request to the EC2 Metadata Service, and no matter how much I fudged with the number of times the for loop is called or the parallel limit, I couldn't get the Metadata Service to return an error (I was also running on a t2.micro instance). (And it doesn't make a difference how many times you call require('aws-sdk'). require() caches module exports, and subsequent calls to require() will simply retrieve from the cache, so there aren't multiple aws-sdk instances being created.)

That said, even though only one request to Metadata Service can be in flight at a time, it is possible to hit the Metadata Service too many times in a short time span (you can keep hitting the Metadata Service as soon the previous response comes back). Given that this error is intermittent ( @dnorth98 mentioned that is happens about 1 out of 1000 times), implementing retries with exponential backoff would likely resolve the problem, as it is unlikely that the 2nd or 3rd try will get the error again. I am actively working on that now and will provide an update when it is finished. Since it was reported above that this problem has also occurred on ECS, I can also implement the exponential backoff in the ECSCredentials provider.

stemail23 · 2016-08-27T10:08:51Z

@LiuJoyceC Thanks for the feedback. I was able to reliably reproduce the issue with the code I provided, but I admit, I haven't looked into it since, so it's possible that things have changed since then. I notice though that you don't mention running multiple processes however, so perhaps that indicates why you couldn't reproduce? To reproduce the issue I needed to run the provided script up to ten times concurrently.

Thanks for looking into the issue. Hopefully you'll have some success with backed off retries, and hopefully the suggestions above might help you test a fix if you can reproduce the problem.

Cheers!

LiuJoyceC · 2016-09-08T18:23:18Z

Hi,

The PR for retrying EC2MetadataCredentials and ECSCredentials has been merged to master, so you can try it out now by cloning the repo, or you can wait for the next release of the SDK in NPM. By default it times out after 1000ms and retries up to 3 times with a base delay of 100ms. If you still get intermittent timeout errors even with this default retry behavior, you can try increasing the timeout, the max retries, and the retry delay:

AWS.config.credentials = new AWS.EC2MetadataCredentials({
    httpOptions: { timeout: 5000 },
    maxRetries: 10,
    retryDelayOptions: { base: 200 }
});

If that still doesn't work, please let me know!

stemail23 · 2016-09-08T22:22:54Z

Thanks @LiuJoyceC

rfink · 2016-10-24T12:56:13Z

So this is still happening in v2.6.9 on an EC2 instance (utilizing elastic beanstalk).

{"message":"Missing credentials in config","name":"CredentialsError","stack":"Error: connect ECONNREFUSED 169.254.169.254:80\n at Object.exports._errnoException (util.js:874:11)\n at exports._exceptionWithHostPort (util.js:897:20)\n at TCPConnectWrap.afterConnect as oncomplete","code":"CredentialsError"}

ckknight · 2017-07-14T00:19:33Z

Ran into this issue locally - was due to some shenanigans with process.env.

Fix was to manually pass in accessKey and secretAccessKey to aws.config.update(...).

JoeMcGuire · 2018-05-10T17:46:19Z

I just hit this issue on AWS ECS (Elastic Container Service) which requires ECSCredentials instead of EC2MetadataCredentials.

AWS.config.credentials = new AWS.ECSCredentials({
  httpOptions: { timeout: 5000 },
  maxRetries: 10,
  retryDelayOptions: { base: 200 }
})

pe8ter · 2018-06-05T13:50:15Z

@LiuJoyceC Should this credentials timeout configuration be created once per require of the AWS SDK, or once globally for an entire application?

rfink · 2018-11-22T13:18:10Z

Still happening for me in ECS with aws-sdk version 2.270.1 and node.js version 10.11

lock · 2019-09-28T23:55:16Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

AdityaManohar added the Enhancement label Sep 8, 2015

Glavin001 mentioned this issue May 25, 2016

Clarify AWS credentials requirements for local DB serverless/serverless-graphql#40

Closed

LiuJoyceC mentioned this issue Aug 26, 2016

EC2MetadataCredentials and ECSCredentials retry #1114

Merged

awsdave assigned LiuJoyceC Aug 26, 2016

awsdave added the in progress label Aug 26, 2016

LiuJoyceC closed this as completed in #1114 Sep 8, 2016

awsdave removed the in progress label Sep 8, 2016

springmeyer mentioned this issue Mar 20, 2017

Require at least aws-sdk >= 2.6.0 mapbox/tilelive-s3#92

Merged

ckknight unassigned LiuJoyceC Jul 14, 2017

alexdrans mentioned this issue Aug 18, 2017

Upgrading aws-sdk dependency bbc/sqs-consumer#90

Merged

mousavian mentioned this issue Aug 22, 2017

Missing credentials in config for lots of parallel requests markitx/dynamo-backup-to-s3#50

Open

mpareja mentioned this issue Jan 2, 2019

elb-cluster-discovery: reduce number of metadata API requests mpareja/node-uncorded#31

Open

srchase added feature-request A feature should be added or improved. and removed enhancement labels Jan 4, 2019

lock bot locked as resolved and limited conversation to collaborators Sep 28, 2019

Missing credentials in config happening intermittently #692

Missing credentials in config happening intermittently #692

Comments

davidporter-id-au commented Aug 27, 2015

AdityaManohar commented Aug 31, 2015

weevilgenius commented Sep 2, 2015

davidporter-id-au commented Sep 3, 2015

weevilgenius commented Sep 3, 2015

stemail23 commented Sep 7, 2015

AdityaManohar commented Sep 8, 2015

stemail23 commented Sep 8, 2015

davidporter-id-au commented Sep 8, 2015

seriousben commented Sep 8, 2015

stemail23 commented Sep 8, 2015

stemail23 commented Sep 8, 2015

davidporter-id-au commented Sep 8, 2015

stemail23 commented Sep 9, 2015

seriousben commented Sep 9, 2015

AdityaManohar commented Sep 14, 2015

areichman commented Dec 8, 2015

zbjornson commented Feb 2, 2016

rfink commented Apr 11, 2016

seriousben commented Apr 11, 2016

bbarney commented May 6, 2016 • edited Loading

Glavin001 commented May 11, 2016 • edited Loading

rogerwaldvogel commented May 25, 2016

samuelsensei commented May 27, 2016

dparmar74 commented Jun 20, 2016

ApsOps commented Jun 23, 2016

rfink commented Jun 28, 2016

juanstiza commented Aug 3, 2016

codan84 commented Aug 17, 2016

seriousben commented Aug 18, 2016 • edited Loading

codan84 commented Aug 18, 2016

seriousben commented Aug 18, 2016

rfink commented Aug 18, 2016

dnorth98 commented Aug 23, 2016

LiuJoyceC commented Aug 23, 2016

dnorth98 commented Aug 23, 2016

LiuJoyceC commented Aug 24, 2016

stemail23 commented Aug 27, 2016

LiuJoyceC commented Sep 8, 2016

stemail23 commented Sep 8, 2016

rfink commented Oct 24, 2016

ckknight commented Jul 14, 2017

JoeMcGuire commented May 10, 2018

pe8ter commented Jun 5, 2018

rfink commented Nov 22, 2018

lock bot commented Sep 28, 2019

`Missing credentials in config` happening intermittently #692

`Missing credentials in config` happening intermittently #692

bbarney commented May 6, 2016 •

edited

Loading

Glavin001 commented May 11, 2016 •

edited

Loading

seriousben commented Aug 18, 2016 •

edited

Loading