[OKD-SCOS v4.16] UPI install with agent installer or assisted installer fails for 4.16.0-0.okd-scos-2024-08-01-132038 #2015
Replies: 12 comments 5 replies
-
I've just checked that, agent-based installer is not a part of docs.okd.io, you can find it on docs.openshift.com. May be this method is not supported by OKD! |
Beta Was this translation helpful? Give feedback.
-
Follow-upAlso fails with the Assisted Installer:The installation fails with the Assisted Installer in the same way. Not a big surprise as the assisted installer uses the agent installer under the hood
Workaround for a success with Assisted Installer:The installation works when replacing the embedded FCOS bootstrap OS by RHCOS:
Workaround for a success with Agent Installer (ABI):Overriding the bootstrap OS image with a RHCOS image make the installation a success also when installing OKD via ABI by using the following when building the install image:
I did not choose a random OS bootstrap OS image, this is the one for v4.16 specified for an OCP installation via the ABI as specified here: https://github.com/openshift/assisted-service/blob/d3324b06a7c7772f4619c3ab13dd8c0706e55fd9/deploy/podman/configmap.yml#L25 Q:
|
Beta Was this translation helpful? Give feedback.
-
Hi, I have encountered the exact same issue. Overriding the bootstrap OS image with a RHCOS image as proposed is not working for me as my lab hardware is using an LSI hardware (Dell Perc H310 using an IT firmware) that is not recognize by the RHCOS image (the mpt3sas driver on RHCOS has disabled the support for this driver). Also the other alternative to use openshift-install-linux-4.15.0-0.okd-scos-2024-01-18-223523 do not work for me : the rendezvous host fails to start the bootstrap with an error 'pull secret for new cluster is invalid: pull secret must contain auth for "registry.ci.openshift.org"'. @titou10titou10, shouldn't you open an issue about this problem ? Best Regards, Alain |
Beta Was this translation helpful? Give feedback.
-
I reproduced this problem with 4.16.0-0.okd-scos-2024-09-27-110344 image on Dell PowerEdge R740. I also verified the proposed RHCOS workaround. That worked fine for me. |
Beta Was this translation helpful? Give feedback.
-
We have OKD releases now based on SCOS bootimages, so this should not be a problem anymore. @titou10titou10 could you give 4.19.0-okd-scos.ec.5 a try - for SNO as well as assisted/agent? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Test install of a SNO with ABI of v4.19.0-okd-scos.ec.5 without overriding the bootstrap image with a n RH image:
Last remark. I have to re-verify this, but during the first boot, the boot process stops, with a screen asking to configure the network with 2 choices, either "quit" or "configure". ... and I like the new 4.19 console visual and new features (favorites, helm,...)...but that's another story lol |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick feedback:
This is because the "machine-os-content" image was still pulling in the rhcos iso, this has been fixed recently. On my next promotion of an ec image, this should be addressed. Meanwhile you can use the latest image in the 4.19.0-0.okd-scos stream, but you need a pull secret for that.
Seems like other components are hitting this issue as well. See: https://issues.redhat.com//browse/OCPBUGS-54175. There is an ongoing fix for this in the MCO. For the issue with the assisted-service-db, could you please collect the journal logs for that unit so we can take a look ? |
Beta Was this translation helpful? Give feedback.
-
OK. That makes sense . A similar fix is being discussed to add |
Beta Was this translation helpful? Give feedback.
-
@Prashanth684 I f I understand well what happened. the just merged fix will be in v4.20.0+, right? Will it be backported to v4.19 or v4.18? |
Beta Was this translation helpful? Give feedback.
-
The latest release nightly: 4.19.0-0.okd-scos-2025-03-28-055146 has the fix. I am going to promote this to an ec. Would appreciate if you could quickly test it though. |
Beta Was this translation helpful? Give feedback.
-
Test install of a SNO with ABI of v4.19.0-okd-scos.ec.6 without overriding the bootstrap image with an RH image:
Alert: 100% of the console/console targets in openshift-console namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.
Message: E0330 13:32:38.406673 1 middleware.go:51] TOKEN_REVIEW: 'GET /metrics' unauthorized, invalid user token, [invalid bearer token, token lookup failed] Experience is quite smooth now. Thanks |
Beta Was this translation helpful? Give feedback.
-
Done a test for a full 3 +3 nodes cluster with v4.19.0-okd-scos.ec.6. Works well.
|
Beta Was this translation helpful? Give feedback.
-
Context
Trying to install a SNO (Single Node) cluster:
It is important to note that the install works perfectly well with the exact same agent and install config files for
I also have tried for a multi-node cluster, it fails the same way
Summary
It fails with the following error from the "release-image-pivot" service:
Details
install-config.yaml:
agent-config.yaml
The install fails after a few minutes the node boots, but the process fails, looping forever
"kubelet" service:
"release-image-pivot" service:
So the "release-image-pivot" fails to start because this problem?:
Other (pertinent?) info:
approve-csr.service
podman
Message from the installer:
Beta Was this translation helpful? Give feedback.
All reactions