-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failed to renew TLS certificates #3294
Comments
As a general question, is there some particular Linux platform where crc might work better? It seems really broken on Fedora 36. Maybe it's better to use CentOS or RHEL? |
@rwmjones It should work with F-36, we (most of crc dev) are using f-36 day to day. Looks like the disk image (bundle) we ship with release have cert expired and regeneration of it is now failing. During our testing we didn't see this issue but we will again check and get back to you. |
On F36 I did 3 starts in a row.
I also ran e2e test once, and it passed. |
Maybe a hardware speed thing? I've done about half a dozen restarts and they all failed in the same way. |
Can you share your hardware configuration? |
Sure. It's an Intel NUC RNUC11PAHi50000 which is a fairly standard 4 core / 8 thread 11th Gen Intel Core i5 mobile chipset. This has Fedora 36 installed on it, and all |
For what it's worth, since I ran into this issue with release 2.6.0, I tried the previous release 2.5.1, and ran into the same issue there:
|
@rwmjones Do you have another hardware where you can run crc (either remote) to see if it that really something around hardware because during my testing (on GCP on a nested virt enabled VM) it took more time then usual but cluster did come up. |
It is our biggest issue as this is not possible OOTB. |
NUC and the other system which also filed an issue was a SFF (Small form factor) so this might be related to cpu throttling due to thermals, however Praveen also saw this on GCP now. Looks like a timing/time out issue. But we didn't see this before. We have automated tests in place to confirm this works, but non in resource constrained environments? |
I dont really have other hardware to run this on. Re: Can we run the certificate renewal manually with custom timeouts to work around this issue? - how to do that? |
There is no mechanism AFAIK to trigger this or define other timeouts. |
I'm hitting the same error with |
I would suggest to wait till next week so we have new version with updated bundle certificate. |
I'm trying |
BTW it always says:
However I'm not sure if this causes a problem. |
I still get 401 Unauthorized errors logging in as kubeadmin, either through the web interface or with
Some parts (eg. the web interface) are running if I log in as |
@rwmjones which means |
Did you want me to just run the crc start command, or to delete and rebuild the whole VM? Anyway the output with just |
Here is a copy of my ^ This is on a 64GB laptop with It seems to work correctly, but reports a failure and complains about not reaching it's intended startup state within 10 mins... |
@ryanj Sometime some of operator doesn't able to reconcile till 10 mins but for you cluster is healthy and running state. The issue I see with @rwmjones is the hardware which is NUC and might take longer than usual but I still don't understand why the kubeconfig file is not updated in the |
@rwmjones Can you please execute following and let us know the output of debug level?
|
@rwmjones Thanks, so as per logs looks like apiserver is not even able to available during allocated time and that cause kubeconfig file not even present in respective directory, this is really looks like hardware related, even we add more wait time the overall performance of this cluster might not suitable for workloads :(
|
Can we make the timeouts longer or configurable? The machine has 16G of RAM and is not swapping. |
@rwmjones I just created a custom linux binary with increased time can you try that please delete the cluster before starting using this binary.
This binary uses following patch (remove fast failure and increase the overall retry time)
|
I tried it twice and it failed both times. gist from the second attempt: https://gist.github.com/rwmjones/084d4abd35e76a4c8b7eab7b7c42b53d I don't think the change to the timeout had any effect since it appeared to only wait 4 mins. Interestingly I tried going inside the VM while it was starting. The VM is only using half available RAM (8GB). I think it could easily be larger. It also only has half available cores (4). However it's not swapping, although it is doing a very large amount of I/O and I think you could give the VM something like total host system RAM - 4 GB, and total host pCPUs - 2, or something like that. |
loadavg inside the VM, several minutes after crc start gave up: |
@rwmjones it does work from node side but then failed for getting configmaps which is again using
yes it is also going to use the memory once all the operator is up and running. Also you can use following to provide more ram and cpu to system but make sure you delete existing instance first
This time patch is diff --git a/pkg/crc/cluster/cluster.go b/pkg/crc/cluster/cluster.go
index 0f5009c1..f8bc9674 100644
--- a/pkg/crc/cluster/cluster.go
+++ b/pkg/crc/cluster/cluster.go
@@ -413,7 +413,7 @@ func WaitForRequestHeaderClientCaFile(ctx context.Context, sshRunner *ssh.Runner
func WaitForAPIServer(ctx context.Context, ocConfig oc.Config) error {
logging.Info("Waiting for kube-apiserver availability... [takes around 2min]")
waitForAPIServer := func() error {
- stdout, stderr, err := ocConfig.WithFailFast().RunOcCommand("get", "nodes")
+ stdout, stderr, err := ocConfig.RunOcCommand("get", "nodes")
if err != nil {
logging.Debug(stderr)
return &errors.RetriableError{Err: err}
@@ -421,7 +421,7 @@ func WaitForAPIServer(ctx context.Context, ocConfig oc.Config) error {
logging.Debug(stdout)
return nil
}
- return errors.Retry(ctx, 4*time.Minute, waitForAPIServer, time.Second)
+ return errors.Retry(ctx, 10*time.Minute, waitForAPIServer, time.Second)
}
func DeleteOpenshiftAPIServerPods(ctx context.Context, ocConfig oc.Config) error {
@@ -431,7 +431,7 @@ func DeleteOpenshiftAPIServerPods(ctx context.Context, ocConfig oc.Config) error
deleteOpenshiftAPIServerPods := func() error {
cmdArgs := []string{"delete", "pod", "--all", "--force", "-n", "openshift-apiserver"}
- _, stderr, err := ocConfig.WithFailFast().RunOcCommand(cmdArgs...)
+ _, stderr, err := ocConfig.RunOcCommand(cmdArgs...)
if err != nil {
return &errors.RetriableError{Err: fmt.Errorf("Failed to delete pod from openshift-apiserver namespace %v: %s", err, stderr)}
}
diff --git a/pkg/crc/cluster/csr.go b/pkg/crc/cluster/csr.go
index 9ed5e78a..181ef781 100644
--- a/pkg/crc/cluster/csr.go
+++ b/pkg/crc/cluster/csr.go
@@ -16,7 +16,7 @@ import (
func WaitForOpenshiftResource(ctx context.Context, ocConfig oc.Config, resource string) error {
logging.Debugf("Waiting for availability of resource type '%s'", resource)
waitForAPIServer := func() error {
- stdout, stderr, err := ocConfig.WithFailFast().RunOcCommand("get", resource)
+ stdout, stderr, err := ocConfig.RunOcCommand("get", resource)
if err != nil {
logging.Debug(stderr)
return &crcerrors.RetriableError{Err: err}
@@ -47,7 +47,7 @@ func getCSRList(ctx context.Context, ocConfig oc.Config, expectedSignerName stri
if err := WaitForOpenshiftResource(ctx, ocConfig, "csr"); err != nil {
return nil, err
}
- output, stderr, err := ocConfig.WithFailFast().RunOcCommand("get", "csr", "-ojson")
+ output, stderr, err := ocConfig.RunOcCommand("get", "csr", "-ojson")
if err != nil {
return nil, fmt.Errorf("Failed to get all certificate signing requests: %v %s", err, stderr)
} |
I can confirm that this time the VM was created with 12GB RAM and 6 cores. |
@rwmjones It went one step forward but then again due to slow IO/proess it fail again, Now I am out of idea around how it can work :( |
I have ordered more RAM. |
I have upgraded the machine to 64 GB of RAM, the maximum possible for this NUC hardware. Surprisingly the default VM created is still 8 GB / 4 cores, I would have expected it to depend on the available host memory and cores in some way. My initial attempt to start crc failed as before, probably because of this. So I used:
It basically fails in the same way as far as I can tell: https://gist.github.com/rwmjones/0c48232408e7396b43a4cdbc64ded877 |
@praveenkumar this might be related to the auth not becoming available in time? |
@gbraad No, as per logs it is failing long before that. |
Hi. For the last two weeks I've been trying to use crc while suffering the same problems as @rwmjones. Renewal of certificates included. During the last tests, I found with horror that my whole home folder was being shared with the crc VM. Because of this, I have created a new user dedicated to run crc, and incidentally this seems to have solved the problems. Also, the new user's home is located in a smaller yet faster drive. Not sure if that's related. I'm using quite modest hardware: i5-4670 CPU @ 3.40GHz (4 core / 4 thread), 16 GB RAM, Fedora 35. |
@robertxgray with latest version of CRC you shouldn't see certificate renewal and file share support is added to crc 2.7.1. Are you seeing this issue with 2.7.1 or with older version? |
@praveenkumar Sorry for the misunderstanding. I mean I've had the same errors as rwmjones since the beginning of this thread, but certificate renewal issues were gone after updating to 2.7.1 as expected. |
@robertxgray But to make 2.7.1 works you had to created a separate user because of home folder is being shared with CRC VM? I want to figure out if this need a different bug and we missed some corner case. |
@praveenkumar I created another user because I didn't want CRC to mess with all the junk stored in my main user's home folder. CRC being able to start with the new user was a nice and unexpected side effect. Before that, I has having the same errors shown in rwmjones' latest logs. I have performed some additional tests moving the new user's home folder to the slow hard drive and CRC still works. Sometimes I get: |
@robertxgray Thank you for confirming. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Thanks for the issue, if it still exist please create new one with latest version of crc. /close |
@praveenkumar: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
General information
crc setup
before starting it YESCRC version
CRC status
CRC config
- consent-telemetry : no
Host Operating System
Steps to reproduce
Expected
CRC should work, I guess?
Actual
This error happens every time I try to use crc.
Logs
https://gist.github.com/rwmjones/3a58df5c478e11e003455243cce0d8f9
The text was updated successfully, but these errors were encountered: