-
Notifications
You must be signed in to change notification settings - Fork 203
Introduce token cache and use it for GitHub App tokens #1745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -677,6 +683,7 @@ func (r *GitRepositoryReconciler) getAuthOpts(ctx context.Context, obj *sourcev1 | |||
Name: sourcev1.GitProviderGitHub, | |||
GitHubOpts: []github.OptFunc{ | |||
github.WithAppData(authData), | |||
github.WithCache(r.tokenCache, sourcev1.GitRepositoryKind, obj.GetName(), obj.GetNamespace()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pkg PR introduces cache in the auth/github pkg and caches the token once it is fetched. In scenarios where the token is invalid (for example: due to incorrectly permissions configured by the user), the git operations would fail with an error (similar to Unauthorized). The token needs to be invalidated from the cache in this case.
When we had attempted to add the cache as part of enabling Azure OIDC for gitrepository, we had discussed in the PR comment and the dev meetings to add the cache in the respective clients instead where the token is actually used. For example, for git operations, the cache would be added to git client. During clone operations, cached token would be fetched from the cache first. In case of a cache HIT, it would be used to do the clone. If the clone operation fails and the token is no longer valid it would be invalidated from the cache. In case of a cache miss or invalid token, a new token would be fetched using auth/github GetToken API. This token would be cached after successful clone operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scenario 1: Permission errors
I just did a simple test here with our implementation for flux-operator, we have this code running in a cluster for a few days.
This was my experiment:
- I revoked the GitHub App permission to read pull requests.
- The object reconciliation started showing this error:
failed to call provider failed to list requests: could not list pull requests: GET https://api.github.com/repos/matheuscscp-test-org/test-repo/pulls?per_page=100&state=open: 403 Resource not accessible by integration []
- I re-added the permission to read pull requests to the GitHub App.
- The issue was resolved, the reconciliation resumed.
To make sure that the cached token was not purged and was reused after I re-added the permission, I ran this command:
k logs flux-operator-79d48cc994-68cds | grep refresh
{"level":"debug","ts":"2025-03-07T20:12:05.691Z","msg":"item refreshed",
"controller":"resourcesetinputprovider","controllerGroup":"fluxcd.controlplane.io",
"controllerKind":"ResourceSetInputProvider","ResourceSetInputProvider":{
"name":"rsip-gh-app-token-cache-test","namespace":"flux-system"},"namespace":"flux-system",
"name":"rsip-gh-app-token-cache-test","reconcileID":"01dd2fc7-ef25-49ab-bbb9-bf4e64e095aa",
"key":"githubAppID=<redacted>,githubAppInstallationID=<redacted>,githubAppBaseURL=,githubAppPrivateKeyDigest=<redacted>",
"token":{"duration":"59m59.308435414s"}}
Only one log was printed, the same one I saw before the experiment by running the same command and looking at the timestamp.
Conclusion: Purging the token from the cache doesn't help when the token is invalid due to permissions. That's because the token represents an identity. The set of permissions associated with that identity are not embedded on the token. So the token is not really invalid, it just happens that the identity it represents lacks permissions. The token remains valid and that is proven by the fact that the reconciliation resumed after I re-added the permissions.
Scenario 2: GitHub App is deleted/replaced, or reinstalled in the org/repo, or the private key is rotated, or the base URL is changed
If any of these events happen, the token key will change, see how the cache key is built:
func (p *Client) buildCacheKey() string {
privateKeyDigest := sha256.Sum256(p.privateKey)
keyParts := []string{
fmt.Sprintf("%s=%s", AppIDKey, p.appID),
fmt.Sprintf("%s=%s", AppInstallationIDKey, p.installationID),
fmt.Sprintf("%s=%s", AppBaseUrlKey, p.apiURL),
fmt.Sprintf("%sDigest=%x", AppPrivateKey, privateKeyDigest),
}
return strings.Join(keyParts, ",")
}
Conclusion: The cached token is effectively a function of (appID, installationID, baseURL, privateKey)
. If these remain the same, the cached token remains valid if it's not expired.
Overall Conclusion
There are no real scenarios where purging the token from the cache really helps, we don't need to worry about that.
P.S.: We expire tokens with 80% of their lifetime, so there's no risk of getting errors due to expired token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the cache in the respective clients instead where the token is actually used
I don't think this approach is good, because a GitHub App could be used for multiple different things, not just cloning a git repo. We're gonna use GitHub App tokens for other things in Flux e.g. for the notification providers. We shouldn't cache a token twice, once for the github
provider and once for the githubdispatch
provider, if the underlying app/installation/baseURL/privateKey is the same for both Provider
objects.
So, again, the token is a function of (appID, installationID, baseURL, privateKey)
, not (appID, installationID, baseURL, privateKey, use case)
.
The same idea will apply when we implement caching for OIDC tokens. They are a function of the identity they represent, not (identity, use case)
. We will share a token for a Bucket
and an OCIRepository
object if the identity they want to impersonate is the same and we already have a token for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the detailed test scenarios and explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was very important 👍 Let's leave this thread unresolved as it contains a lot of information that is important for whoever reads this PR in the future, even after merged
Signed-off-by: Matheus Pimenta <[email protected]>
9c397e9
to
9593041
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @matheuscscp 🏅
@matheuscscp I missed that we're importing different versions of go-github. Please fix this in a followup PR. |
@stefanprodan Looked into the 2 go-github indirect references - they seem to be coming from different direct dependencies. The cosign dependency is latest 2.4.1, but its using an older version of go-github (linked here)
|
@matheuscscp Missed this during review, we should update the documentation for source-controller indicating token cache configuration options. |
This PR introduces a token cache in source-controller. At the moment, we use it for GitHub App tokens in the
GitRepository
API.This token cache will report the following metrics:
gotk_token_cache_events_total
Total number of cache retrieval events for a Gitops Toolkit resource reconciliation.
event_type
,kind
,name
,exported_namespace
. The values ofevent_type
can becache_miss
orcache_hit
.gotk_token_cache_requests_total
Total number of cache requests partioned by success or failure.
status
. The values ofstatus
can besuccess
orfailure
.gotk_token_cache_evictions_total
Total number of cache evictions.
gotk_token_cached_items
Total number of items in the cache.