DRIVERS-2333 Cache AWS Credentials Where Possible #1281

blink1073 · 2022-07-27T20:13:20Z

Please complete the following before merging:

Bump spec version and last modified date.
Update changelog.
Make sure there are generated JSON files from the YAML test files. - N/A
Test changes in at least one language driver.
- [DRIVERS-2333] Use helper functions for cached credentials pymongo-auth-aws#10
- PYTHON-3313 Cache AWS Credentials Where Possible mongo-python-driver#982
Test these changes against all server versions and topologies (including standalone, replica set, sharded clusters, and serverless). - N/A

JamesKovacs

Some questions and clarifications requested.

JamesKovacs · 2022-08-05T21:50:29Z

source/auth/auth.rst

+___________________
+Credentials fetched by the driver using AWS endpoints MUST be cached and reused
+to avoid hitting rate limitations.  The "Expiration" field must be stored and
+used to determine when to clear the cache.  If the "Expiration" is within 5


How was 5 minutes selected? Why not fetch new credentials when the previous credentials expire and cache those new credentials? Why do we want to expire early?

What is the typical TTL of newly fetched credentials? Should the expiration be X% of the TTL with a maximum? e.g. Let's say we choose 5% with a maximum of 5 minutes, then a 1 hour TTL would be refreshed with 3 minutes remaining whereas a 2 hour TTL would be capped at a 5 minute refresh.

Who is responsible for clearing and refreshing the cache? A background process/thread? The next connection to find the cache empty? If the former, what should a new connection do when it needs credentials but finds the cache empty? Block? Trigger a refresh event? If the latter, how do we prevent multiple new connections from fetching fresh credentials and populating the cache?

5 minutes was selected based on the AWS docs: "We make new credentials available at least five minutes before the expiration of the old credentials". The main intent was to have some buffer between when the driver fetches the credentials and the server verifies them. For EC2 credentials, there is no defined TTL, "These security credentials are temporary and we rotate them automatically". We could base it on a percentage if that is preferable.

As for who is responsible for clearing and refreshing the cache I think that would be driver dependent. For a non-blocking driver they should use whatever standard locking mechanisms make sense for the language. For example in Python we can use a threading.Lock to access the global cache. IMO we should state something like: Since the cache is intended to be global, it is up to the driver to ensure that the cache is refreshed by only one process/thread at a time. I don't think the usage of a cache should change the normal behavior of auth fetch, since you'd either block or release control of a thread/coroutine to make the network call previously.

Thanks for the explanation. I would include a brief explanation of the reasoning and a link to the AWS docs as others will likely have the same question in the future. This will be especially true if Amazon changes this 5-minute refresh design decision in the future. Since the 5-minute refresh is documented by Amazon, I'm good with using this absolute value rather than a relative one.

As for the responsibility of clearing and refreshing the cache, I agree that the exact implementation is going to be driver dependent. What I think we have to specify is the expectations. Should the refresh block the creation of new connections? Or should those new connections use whatever the latest cached credentials are? I would argue the latter as I'd rather use credentials that are about to expire in 5 minutes (or less if the refresh is stuck) than block connection requests because of a token refresh problem.

Great, I'll update to give more detail and the link.

I think there are two possibilities for cache refresh: there is a background task in charge of pre-emptively fetching credentials before the 5 minute window, or the credentials are checked and fetched on demand (as is currently implemented in the Python driver). I think the 5 minutes is irrelevant in the case of checking and fetching on demand, because it is more likely that you're checking the cached credentials after they have already expired, rather than within the 5 minute expiration window. If we allowed the ones in the 5 minute window to be used, we'd end up having to specify another "grace" period, or else risk a race condition with the server on credential expiration. What I'm thinking is the following wording:

Cached credentials can be monitored and refreshed in a background task or fetched on demand during authorization. If credentials are fetched on demand, the driver MUST include appropriate locking mechanisms to prevent simultaneous fetches. Drivers MUST block or yield execution until credentials are available from the single cache provider, regardless of implementation.

Although we can leave the implementations details of how to refresh the credentials to each individual driver (e.g. background thread, task, etc.), we need to specify the desired refresh behaviour more specifically. For example, what happens if the credentials endpoint is slow to respond? We don't want to block all connections attempting to authenticate on a lock. We have credentials that are still good for another 5 minutes. Why not use the existing credentials to authenticate the connections and hope that the credentials endpoint will have recovered before the credentials expire. Once the existing credentials expire, there is not much that we can do about it. So at that point it is reasonable to return auth failures and log some warnings about the credentials endpoint being slow to respond.

My concern is that we specify the refresh behaviour so that we fail gracefully and don't end up with multiple threads/connections all blocked on a lock waiting for a slow HTTP endpoint to respond.

I am still confused. What we have today isn't a problem of slow responses from AWS endpoints, it is rate limiting when there are a bunch of connections created at once. I don't think we need to over-index on whether we have to wait for credentials at all. I think it is perfectly reasonable to fetch them the same way we currently do it (but only one at a time now that we have a lock), on the off chance that we are within the 5 minute expiration window.

I totally agree with the goal of reducing the number of requests to the AWS endpoint used to retrieve credentials to avoid hitting rate limits. But let's imagine a scenario where the endpoint is temporarily slow to respond for whatever reason. We are going from a situation where one connection (out of many) will take longer than expected to authenticate to a situation where we've blocked all new connection creation until that one slow request completes. That's what I'm trying to avoid.

Why am I concerned about this? Because when we implemented maxConnecting with a fixed value of 2, we had users complain that they wanted it tuneable. If we introduce this lock, we are effectively reducing maxConnecting to 1 whenever the AWS credentials need refreshing.

Hopefully that provides some more context on why we need to be careful about the refresh semantics.

I still am not seeing a way around blocking at some point if we do not have background tasks available on the driver, given the desire to only have one AWS request in flight at a time. Whether this happens within the 5 minute window or after, incoming requests would be blocked until the cache was ready.

Perhaps if we just remove the lock on the cache, then everything is okay. Absolute worst case is that we have the current condition of several attempted requests during the 5 minute window, and the user gets rate limited as they do today. But in most cases, we either use the cache or there is a single request to update the cache at any one time. Then there is no new unwanted behavior, and the behavior is in general much better.

We could implement this non-blocking behaviour - whether we use a background thread, a task, or some other language-specific mechanism - via a compare exchange operation, a reader-write lock, or some other synchronization primitive. For example in C#:

private SomeCredential _currentCredential; public SomeCredential CurrentCredential { get => Interlocked.CompareExchange(ref _currentCredential, null, null); set => Interlocked.Exchange(ref _currentCredential, value); }

This would allow you to read the current credentials at any point while allowing another thread to update them safely. You would need some other mechanism to ensure only one thread is updating at a time. This could be a dedicated maintenance thread, a task scheduled to execute on a thread pool, or setting a variable (via Interlocked or similar) to indicate that the credentials are being updated. This would allow one credential update operation at a time while other threads can still access the cached value.

blink1073 · 2022-08-19T21:01:52Z

@JamesKovacs and I spoke today to talk through the problem. Based on our reading of the spec, maxConnecting encompasses the authentication phase, which means we can allow unbounded requests for cache refresh and they will be effectively limited by maxConnecting. Writes to the cache must be atomic, and last one writing wins.

If the cache is still valid, the driver should attempt to use it immediately. If there is no cache or the cache has expired, we need to block on a request to update the cache. Drivers are encouraged to update the cache proactively in a background task before it expires so that is available for connections without delay. There is a small risk that valid credentials expire before giving them to the server, in which case a reconnect will need to be attempted by the user.

I will update the PR on Monday.

jyemin · 2022-08-19T21:05:51Z

we can allow unbounded requests for cache refresh and they will be effectively limited by maxConnecting.

More precisely, limited by maxConnecting * number-of-servers-under-active-selection

DmitryLukyanov · 2022-10-07T16:50:39Z

source/auth/auth.rst

+If there are no current valid cached credentials, the driver MUST initiate a
+credential request.  To avoid adding a bottleneck that would override the
+``maxConnecting`` setting, the driver MUST not place a lock on making a
+request.  The cache MUST be written atomically.


nit: double space

not fixed :)

Fixed again. :)

JamesKovacs

Minor formatting issues.

JamesKovacs · 2022-10-07T21:25:39Z

source/auth/auth.rst

+<<<<<<< HEAD
+=======
+:Last Modified: 2022-07-27
+>>>>>>> a1756248 (update changelog and version)


Fix merge conflicts. Much of the front matter was removed by https://jira.mongodb.org/browse/DRIVERS-2253.

JamesKovacs · 2022-10-07T21:28:14Z

source/auth/auth.rst

+
+If using manual caching, the "Expiration" field MUST be stored
+and used to determine when to clear the cache. Credentials are considered
+valid if they are more than five minutes away from expiring, to the reduce the


Comma splice.

JamesKovacs · 2022-10-07T21:28:57Z

source/auth/auth.rst

+If there are no current valid cached credentials, the driver MUST initiate a
+credential request. To avoid adding a bottleneck that would override the
+``maxConnecting`` setting, the driver MUST not place a lock on making a
+request.  he cache MUST be written atomically.


he cache MUST

The cache MUST

JamesKovacs · 2022-10-07T21:29:31Z

source/auth/auth.rst

+If AWS authentication fails for any reason, the cache MUST be cleared.
+
+.. note::
+    Five minutes was chosen because based on the AWS documentation for `IAM roles for EC2 <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html>`_ : "We make new credentials available at least five minutes before the expiration of the old credentials". The intent is to have some buffer between when the driver fetches the credentials and when the server verifies them.


because based on

based on

JamesKovacs · 2022-10-07T21:31:11Z

source/auth/tests/mongodb-aws.rst

+
+#. Clear the cache.
+#. Create a new client.
+#. Ensure that a ``find`` operation adds credentials to the cache..


Double period at end of sentence.

JamesKovacs · 2022-10-07T21:33:12Z

source/auth/tests/mongodb-aws.rst

+   minute of the current UTC time.
+#. Create a new client.
+#. Ensure that a ``find`` operation updates the credentials in the cache.
+#. Poison the cache with invalid auth content.


Is it worth describing in more detail how to poison the cache?

The updated language was Poison the cache with an invalid access key id., it just doesn't show up in this UI.

DmitryLukyanov

LGTM, assuming James's comments are resolved

blink1073 requested a review from a team as a code owner July 27, 2022 20:13

blink1073 requested review from JamesKovacs and removed request for a team July 27, 2022 20:13

JamesKovacs requested changes Aug 5, 2022

View reviewed changes

blink1073 mentioned this pull request Oct 7, 2022

CSHARP-4273: Cache AWS Credentials Where Possible. mongodb/mongo-csharp-driver#894

Merged

DRIVERS-2333 Cache AWS Credentials Where Possible

54bdcf6

DmitryLukyanov reviewed Oct 7, 2022

View reviewed changes

blink1073 added 6 commits October 7, 2022 13:36

update changelog and version

7282056

Add note about the five minute window

a73b5d3

address review

00f1985

update language to encourage SDK use

f5e7b71

Update to use 5 minutes and no background task

ce365eb

address review

df74cbe

blink1073 force-pushed the DRIVERS-2333 branch from 20a9b17 to df74cbe Compare October 7, 2022 18:38

blink1073 requested review from JamesKovacs and DmitryLukyanov October 7, 2022 18:38

JamesKovacs requested changes Oct 7, 2022

View reviewed changes

DmitryLukyanov approved these changes Oct 9, 2022

View reviewed changes

address review

8f0fb92

blink1073 requested a review from JamesKovacs October 12, 2022 15:18

fix double space

331cf49

blink1073 merged commit 364761d into mongodb:master Oct 18, 2022

blink1073 deleted the DRIVERS-2333 branch October 18, 2022 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRIVERS-2333 Cache AWS Credentials Where Possible #1281

DRIVERS-2333 Cache AWS Credentials Where Possible #1281

blink1073 commented Jul 27, 2022

JamesKovacs left a comment

JamesKovacs Aug 5, 2022

blink1073 Aug 8, 2022

JamesKovacs Aug 8, 2022

blink1073 Aug 10, 2022

JamesKovacs Aug 11, 2022

blink1073 Aug 12, 2022

JamesKovacs Aug 15, 2022

blink1073 Aug 18, 2022

JamesKovacs Aug 18, 2022

blink1073 commented Aug 19, 2022

jyemin commented Aug 19, 2022

DmitryLukyanov Oct 7, 2022

blink1073 Oct 7, 2022

DmitryLukyanov Oct 17, 2022

blink1073 Oct 17, 2022

JamesKovacs left a comment

JamesKovacs Oct 7, 2022

blink1073 Oct 10, 2022

JamesKovacs Oct 7, 2022

blink1073 Oct 10, 2022

JamesKovacs Oct 7, 2022

blink1073 Oct 10, 2022

JamesKovacs Oct 7, 2022

blink1073 Oct 10, 2022

JamesKovacs Oct 7, 2022

blink1073 Oct 10, 2022

JamesKovacs Oct 7, 2022

blink1073 Oct 10, 2022

DmitryLukyanov Oct 17, 2022

blink1073 Oct 17, 2022

DmitryLukyanov left a comment

DRIVERS-2333 Cache AWS Credentials Where Possible #1281

DRIVERS-2333 Cache AWS Credentials Where Possible #1281

Conversation

blink1073 commented Jul 27, 2022

JamesKovacs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blink1073 commented Aug 19, 2022

jyemin commented Aug 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesKovacs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitryLukyanov left a comment

Choose a reason for hiding this comment