Skip to content

Introduce token cache and use it for GitHub App tokens #1745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 12, 2025
Merged

Conversation

matheuscscp
Copy link
Member

This PR introduces a token cache in source-controller. At the moment, we use it for GitHub App tokens in the GitRepository API.

This token cache will report the following metrics:

  • Counter gotk_token_cache_events_total
    • Help: Total number of cache retrieval events for a Gitops Toolkit resource reconciliation.
    • Labels: event_type, kind, name, exported_namespace. The values of event_type can be cache_miss or cache_hit.
  • Counter gotk_token_cache_requests_total
    • Help: Total number of cache requests partioned by success or failure.
    • Labels: status. The values of status can be success or failure.
  • Counter gotk_token_cache_evictions_total
    • Help: Total number of cache evictions.
    • No labels.
  • Gauge gotk_token_cached_items
    • Help: Total number of items in the cache.
    • No labels.

@matheuscscp matheuscscp requested a review from dipti-pai March 7, 2025 18:19
@@ -677,6 +683,7 @@ func (r *GitRepositoryReconciler) getAuthOpts(ctx context.Context, obj *sourcev1
Name: sourcev1.GitProviderGitHub,
GitHubOpts: []github.OptFunc{
github.WithAppData(authData),
github.WithCache(r.tokenCache, sourcev1.GitRepositoryKind, obj.GetName(), obj.GetNamespace()),
Copy link
Member

@dipti-pai dipti-pai Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pkg PR introduces cache in the auth/github pkg and caches the token once it is fetched. In scenarios where the token is invalid (for example: due to incorrectly permissions configured by the user), the git operations would fail with an error (similar to Unauthorized). The token needs to be invalidated from the cache in this case.

When we had attempted to add the cache as part of enabling Azure OIDC for gitrepository, we had discussed in the PR comment and the dev meetings to add the cache in the respective clients instead where the token is actually used. For example, for git operations, the cache would be added to git client. During clone operations, cached token would be fetched from the cache first. In case of a cache HIT, it would be used to do the clone. If the clone operation fails and the token is no longer valid it would be invalidated from the cache. In case of a cache miss or invalid token, a new token would be fetched using auth/github GetToken API. This token would be cached after successful clone operation.

Copy link
Member Author

@matheuscscp matheuscscp Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario 1: Permission errors

I just did a simple test here with our implementation for flux-operator, we have this code running in a cluster for a few days.

This was my experiment:

  1. I revoked the GitHub App permission to read pull requests.
  2. The object reconciliation started showing this error: failed to call provider failed to list requests: could not list pull requests: GET https://api.github.com/repos/matheuscscp-test-org/test-repo/pulls?per_page=100&state=open: 403 Resource not accessible by integration []
  3. I re-added the permission to read pull requests to the GitHub App.
  4. The issue was resolved, the reconciliation resumed.

To make sure that the cached token was not purged and was reused after I re-added the permission, I ran this command:

k logs flux-operator-79d48cc994-68cds | grep refresh
{"level":"debug","ts":"2025-03-07T20:12:05.691Z","msg":"item refreshed",
"controller":"resourcesetinputprovider","controllerGroup":"fluxcd.controlplane.io",
"controllerKind":"ResourceSetInputProvider","ResourceSetInputProvider":{
"name":"rsip-gh-app-token-cache-test","namespace":"flux-system"},"namespace":"flux-system",
"name":"rsip-gh-app-token-cache-test","reconcileID":"01dd2fc7-ef25-49ab-bbb9-bf4e64e095aa",
"key":"githubAppID=<redacted>,githubAppInstallationID=<redacted>,githubAppBaseURL=,githubAppPrivateKeyDigest=<redacted>",
"token":{"duration":"59m59.308435414s"}}

Only one log was printed, the same one I saw before the experiment by running the same command and looking at the timestamp.

Conclusion: Purging the token from the cache doesn't help when the token is invalid due to permissions. That's because the token represents an identity. The set of permissions associated with that identity are not embedded on the token. So the token is not really invalid, it just happens that the identity it represents lacks permissions. The token remains valid and that is proven by the fact that the reconciliation resumed after I re-added the permissions.

Scenario 2: GitHub App is deleted/replaced, or reinstalled in the org/repo, or the private key is rotated, or the base URL is changed

If any of these events happen, the token key will change, see how the cache key is built:

https://github.com/fluxcd/pkg/blob/8b1f852228b117123efcc7b5708ce112801416b6/auth/github/client.go#L232-L241

func (p *Client) buildCacheKey() string {
	privateKeyDigest := sha256.Sum256(p.privateKey)
	keyParts := []string{
		fmt.Sprintf("%s=%s", AppIDKey, p.appID),
		fmt.Sprintf("%s=%s", AppInstallationIDKey, p.installationID),
		fmt.Sprintf("%s=%s", AppBaseUrlKey, p.apiURL),
		fmt.Sprintf("%sDigest=%x", AppPrivateKey, privateKeyDigest),
	}
	return strings.Join(keyParts, ",")
}

Conclusion: The cached token is effectively a function of (appID, installationID, baseURL, privateKey). If these remain the same, the cached token remains valid if it's not expired.

Overall Conclusion

There are no real scenarios where purging the token from the cache really helps, we don't need to worry about that.

P.S.: We expire tokens with 80% of their lifetime, so there's no risk of getting errors due to expired token.

Copy link
Member Author

@matheuscscp matheuscscp Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the cache in the respective clients instead where the token is actually used

I don't think this approach is good, because a GitHub App could be used for multiple different things, not just cloning a git repo. We're gonna use GitHub App tokens for other things in Flux e.g. for the notification providers. We shouldn't cache a token twice, once for the github provider and once for the githubdispatch provider, if the underlying app/installation/baseURL/privateKey is the same for both Provider objects.

So, again, the token is a function of (appID, installationID, baseURL, privateKey), not (appID, installationID, baseURL, privateKey, use case).

The same idea will apply when we implement caching for OIDC tokens. They are a function of the identity they represent, not (identity, use case). We will share a token for a Bucket and an OCIRepository object if the identity they want to impersonate is the same and we already have a token for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the detailed test scenarios and explanation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was very important 👍 Let's leave this thread unresolved as it contains a lot of information that is important for whoever reads this PR in the future, even after merged

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @matheuscscp 🏅

@stefanprodan stefanprodan added enhancement New feature or request area/git Git related issues and pull requests labels Mar 12, 2025
@matheuscscp matheuscscp merged commit 840bea6 into main Mar 12, 2025
10 checks passed
@matheuscscp matheuscscp deleted the token-cache branch March 12, 2025 21:45
@stefanprodan
Copy link
Member

@matheuscscp I missed that we're importing different versions of go-github. Please fix this in a followup PR.

@dipti-pai
Copy link
Member

dipti-pai commented Mar 13, 2025

@matheuscscp I missed that we're importing different versions of go-github. Please fix this in a followup PR.

@stefanprodan Looked into the 2 go-github indirect references - they seem to be coming from different direct dependencies. The cosign dependency is latest 2.4.1, but its using an older version of go-github (linked here)

This module is necessary because github.com/google/go-github/v55/github is imported in:

- github.com/fluxcd/source-controller/internal/oci/cosign
-- github.com/sigstore/cosign/v2/cmd/cosign/cli/options
--- github.com/sigstore/cosign/v2/pkg/signature
---- github.com/sigstore/cosign/v2/pkg/cosign/git
----- github.com/sigstore/cosign/v2/pkg/cosign/git/github

This module is necessary because github.com/google/go-github/v69/github is imported in:

- github.com/fluxcd/source-controller/internal/controller
-- github.com/fluxcd/pkg/git/github
--- github.com/bradleyfalzon/ghinstallation/v2

@dipti-pai
Copy link
Member

@matheuscscp Missed this during review, we should update the documentation for source-controller indicating token cache configuration options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/git Git related issues and pull requests enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants