Skip to content

DRIVERS-2333 Cache AWS Credentials Where Possible #1281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 18, 2022

Conversation

blink1073
Copy link
Member

Please complete the following before merging:

@blink1073 blink1073 requested a review from a team as a code owner July 27, 2022 20:13
@blink1073 blink1073 requested review from JamesKovacs and removed request for a team July 27, 2022 20:13
Copy link
Contributor

@JamesKovacs JamesKovacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and clarifications requested.

___________________
Credentials fetched by the driver using AWS endpoints MUST be cached and reused
to avoid hitting rate limitations. The "Expiration" field must be stored and
used to determine when to clear the cache. If the "Expiration" is within 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was 5 minutes selected? Why not fetch new credentials when the previous credentials expire and cache those new credentials? Why do we want to expire early?

What is the typical TTL of newly fetched credentials? Should the expiration be X% of the TTL with a maximum? e.g. Let's say we choose 5% with a maximum of 5 minutes, then a 1 hour TTL would be refreshed with 3 minutes remaining whereas a 2 hour TTL would be capped at a 5 minute refresh.

Who is responsible for clearing and refreshing the cache? A background process/thread? The next connection to find the cache empty? If the former, what should a new connection do when it needs credentials but finds the cache empty? Block? Trigger a refresh event? If the latter, how do we prevent multiple new connections from fetching fresh credentials and populating the cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 minutes was selected based on the AWS docs: "We make new credentials available at least five minutes before the expiration of the old credentials". The main intent was to have some buffer between when the driver fetches the credentials and the server verifies them. For EC2 credentials, there is no defined TTL, "These security credentials are temporary and we rotate them automatically". We could base it on a percentage if that is preferable.

As for who is responsible for clearing and refreshing the cache I think that would be driver dependent. For a non-blocking driver they should use whatever standard locking mechanisms make sense for the language. For example in Python we can use a threading.Lock to access the global cache. IMO we should state something like: Since the cache is intended to be global, it is up to the driver to ensure that the cache is refreshed by only one process/thread at a time. I don't think the usage of a cache should change the normal behavior of auth fetch, since you'd either block or release control of a thread/coroutine to make the network call previously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I would include a brief explanation of the reasoning and a link to the AWS docs as others will likely have the same question in the future. This will be especially true if Amazon changes this 5-minute refresh design decision in the future. Since the 5-minute refresh is documented by Amazon, I'm good with using this absolute value rather than a relative one.

As for the responsibility of clearing and refreshing the cache, I agree that the exact implementation is going to be driver dependent. What I think we have to specify is the expectations. Should the refresh block the creation of new connections? Or should those new connections use whatever the latest cached credentials are? I would argue the latter as I'd rather use credentials that are about to expire in 5 minutes (or less if the refresh is stuck) than block connection requests because of a token refresh problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I'll update to give more detail and the link.

I think there are two possibilities for cache refresh: there is a background task in charge of pre-emptively fetching credentials before the 5 minute window, or the credentials are checked and fetched on demand (as is currently implemented in the Python driver). I think the 5 minutes is irrelevant in the case of checking and fetching on demand, because it is more likely that you're checking the cached credentials after they have already expired, rather than within the 5 minute expiration window. If we allowed the ones in the 5 minute window to be used, we'd end up having to specify another "grace" period, or else risk a race condition with the server on credential expiration. What I'm thinking is the following wording:

  • Cached credentials can be monitored and refreshed in a background task or fetched on demand during authorization. If credentials are fetched on demand, the driver MUST include appropriate locking mechanisms to prevent simultaneous fetches. Drivers MUST block or yield execution until credentials are available from the single cache provider, regardless of implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we can leave the implementations details of how to refresh the credentials to each individual driver (e.g. background thread, task, etc.), we need to specify the desired refresh behaviour more specifically. For example, what happens if the credentials endpoint is slow to respond? We don't want to block all connections attempting to authenticate on a lock. We have credentials that are still good for another 5 minutes. Why not use the existing credentials to authenticate the connections and hope that the credentials endpoint will have recovered before the credentials expire. Once the existing credentials expire, there is not much that we can do about it. So at that point it is reasonable to return auth failures and log some warnings about the credentials endpoint being slow to respond.

My concern is that we specify the refresh behaviour so that we fail gracefully and don't end up with multiple threads/connections all blocked on a lock waiting for a slow HTTP endpoint to respond.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still confused. What we have today isn't a problem of slow responses from AWS endpoints, it is rate limiting when there are a bunch of connections created at once. I don't think we need to over-index on whether we have to wait for credentials at all. I think it is perfectly reasonable to fetch them the same way we currently do it (but only one at a time now that we have a lock), on the off chance that we are within the 5 minute expiration window.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with the goal of reducing the number of requests to the AWS endpoint used to retrieve credentials to avoid hitting rate limits. But let's imagine a scenario where the endpoint is temporarily slow to respond for whatever reason. We are going from a situation where one connection (out of many) will take longer than expected to authenticate to a situation where we've blocked all new connection creation until that one slow request completes. That's what I'm trying to avoid.

Why am I concerned about this? Because when we implemented maxConnecting with a fixed value of 2, we had users complain that they wanted it tuneable. If we introduce this lock, we are effectively reducing maxConnecting to 1 whenever the AWS credentials need refreshing.

Hopefully that provides some more context on why we need to be careful about the refresh semantics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still am not seeing a way around blocking at some point if we do not have background tasks available on the driver, given the desire to only have one AWS request in flight at a time. Whether this happens within the 5 minute window or after, incoming requests would be blocked until the cache was ready.

Perhaps if we just remove the lock on the cache, then everything is okay. Absolute worst case is that we have the current condition of several attempted requests during the 5 minute window, and the user gets rate limited as they do today. But in most cases, we either use the cache or there is a single request to update the cache at any one time. Then there is no new unwanted behavior, and the behavior is in general much better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could implement this non-blocking behaviour - whether we use a background thread, a task, or some other language-specific mechanism - via a compare exchange operation, a reader-write lock, or some other synchronization primitive. For example in C#:

private SomeCredential _currentCredential;
                                                                            
public SomeCredential CurrentCredential
{
    get => Interlocked.CompareExchange(ref _currentCredential, null, null);
    set => Interlocked.Exchange(ref _currentCredential, value);
}

This would allow you to read the current credentials at any point while allowing another thread to update them safely. You would need some other mechanism to ensure only one thread is updating at a time. This could be a dedicated maintenance thread, a task scheduled to execute on a thread pool, or setting a variable (via Interlocked or similar) to indicate that the credentials are being updated. This would allow one credential update operation at a time while other threads can still access the cached value.

@blink1073
Copy link
Member Author

@JamesKovacs and I spoke today to talk through the problem. Based on our reading of the spec, maxConnecting encompasses the authentication phase, which means we can allow unbounded requests for cache refresh and they will be effectively limited by maxConnecting. Writes to the cache must be atomic, and last one writing wins.

If the cache is still valid, the driver should attempt to use it immediately. If there is no cache or the cache has expired, we need to block on a request to update the cache. Drivers are encouraged to update the cache proactively in a background task before it expires so that is available for connections without delay. There is a small risk that valid credentials expire before giving them to the server, in which case a reconnect will need to be attempted by the user.

I will update the PR on Monday.

@jyemin
Copy link
Contributor

jyemin commented Aug 19, 2022

we can allow unbounded requests for cache refresh and they will be effectively limited by maxConnecting.

More precisely, limited by maxConnecting * number-of-servers-under-active-selection

If there are no current valid cached credentials, the driver MUST initiate a
credential request. To avoid adding a bottleneck that would override the
``maxConnecting`` setting, the driver MUST not place a lock on making a
request. The cache MUST be written atomically.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double space

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not fixed :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed again. :)

Copy link
Contributor

@JamesKovacs JamesKovacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting issues.

<<<<<<< HEAD
=======
:Last Modified: 2022-07-27
>>>>>>> a1756248 (update changelog and version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix merge conflicts. Much of the front matter was removed by https://jira.mongodb.org/browse/DRIVERS-2253.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


If using manual caching, the "Expiration" field MUST be stored
and used to determine when to clear the cache. Credentials are considered
valid if they are more than five minutes away from expiring, to the reduce the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comma splice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

If there are no current valid cached credentials, the driver MUST initiate a
credential request. To avoid adding a bottleneck that would override the
``maxConnecting`` setting, the driver MUST not place a lock on making a
request. he cache MUST be written atomically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

he cache MUST

The cache MUST

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

If AWS authentication fails for any reason, the cache MUST be cleared.

.. note::
Five minutes was chosen because based on the AWS documentation for `IAM roles for EC2 <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html>`_ : "We make new credentials available at least five minutes before the expiration of the old credentials". The intent is to have some buffer between when the driver fetches the credentials and when the server verifies them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because based on

based on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


#. Clear the cache.
#. Create a new client.
#. Ensure that a ``find`` operation adds credentials to the cache..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double period at end of sentence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

minute of the current UTC time.
#. Create a new client.
#. Ensure that a ``find`` operation updates the credentials in the cache.
#. Poison the cache with invalid auth content.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth describing in more detail how to poison the cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not done?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated language was Poison the cache with an invalid access key id., it just doesn't show up in this UI.

Copy link

@DmitryLukyanov DmitryLukyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming James's comments are resolved

@blink1073 blink1073 requested a review from JamesKovacs October 12, 2022 15:18
@blink1073 blink1073 merged commit 364761d into mongodb:master Oct 18, 2022
@blink1073 blink1073 deleted the DRIVERS-2333 branch October 18, 2022 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants