Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected command output nsenter: cannot open : No such file or directory #15802

Closed
djdevin opened this issue Aug 16, 2017 · 14 comments
Closed
Assignees
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1

Comments

@djdevin
Copy link

djdevin commented Aug 16, 2017

I am getting this error in the node log, but also "Error syncing pod" and "Pod sandbox changed, it will be killed and re-created." in the "Events" for this build pod.

W0816 17:28:05.379591    7651 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "php-15-build_coretest": Unexpected command output nsenter: cannot open : No such file or directory

This happens on a build pod being launched on that node. Pods launched on the master don't display this issue.

Version

oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://xyz:8443
openshift v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7

Additional Information

This was an upgrade from containerized v1.5.1 -> v3.6.0 (no RCs in between) using the upgrade playbook in the release-3.6 branch of openshift-ansible which was successfully.

This appeared to be fixed in #15210 but doesn't help, and I've verified that the most recent node image is running and /opt is populated in the node container.

find
.
./cni
./cni/bin
./cni/bin/host-local
./cni/bin/loopback
./cni/bin/openshift-sdn

I have destroyed the node in question and reinstalled with the ansible scale playbook and that did not help.

oc adm diagnostics only returned this error

ERROR: [DNet2006 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:136]
       Creating network diagnostic pod "network-diag-pod-bm41g" on node "ip-172-31-10-48.ec2.internal" with command "openshift infra network-diagnostic-pod -l 1" failed: pods "network-diag-pod-bm41g" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used provider restricted: .spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used provider restricted: .spec.securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used provider restricted: .spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed provider restricted: .spec.containers[0].securityContext.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used provider restricted: .spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used provider restricted: .spec.containers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used provider restricted: .spec.containers[0].securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used]

So it looks like pod connectivity is broken, but for the life of me I can't figure out why.

I could simply reinstall, as this is a test environment, but this is going to go to production eventually so I'd like to see if there's a way to fix this if we encounter it there.

@pweil- pweil- added priority/P1 component/networking kind/bug Categorizes issue or PR as related to a bug. labels Aug 16, 2017
@makentenza
Copy link

@djdevin, this will probably not be your case, but just in case, I got the same behaviour while upgrading from containerized 3.5 to 3.6. I had a typo in the resources limits and requests for my builds and every build failed with 'docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "php-15-build_coretest": Unexpected command output nsenter: cannot open : No such file or directory' error.

This is the errata I had:

          limits:
            cpu: 1000m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 256m

The units for memory were configured as 'm' instead 'Mi'. This typo was ignored in 3.5, but not in 3.6.

As I said, this will probably not be your case, so feel free to remove the comment is this not your case.

@djdevin
Copy link
Author

djdevin commented Aug 17, 2017

Wow, that is exactly the case. Thanks!

I'll leave this open though because the documentation is out of date for both OCP and Origin in some places:

https://docs.openshift.com/container-platform/3.6/install_config/build_defaults_overrides.html#ansible-setting-global-build-defaults

https://docs.openshift.org/latest/install_config/build_defaults_overrides.html

I noticed it had been updated elsewhere, but we use the Ansible playbooks for deployments, so I copied the examples from the documentation.

@makentenza
Copy link

Yes, just reviewed and both repositories have this errata. Will change that and make a PR. Will update this issue as well with the info so you could close it once is solved.

@makentenza
Copy link

The following PR have been created:

openshift/openshift-ansible#5135
openshift/openshift-docs#5043

@caruccio
Copy link

caruccio commented Sep 5, 2017

I'm facing the exactly same error, but once in a while.

Set 04 21:39:40 ip-10-0-2-188.ec2.internal origin-node[11690]: W0904 21:39:40.178897   11690 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "apiqueue-8-deploy_guiaon": Unexpected command output nsenter: cannot open : No such file or directory
Set 04 21:39:40 ip-10-0-2-188.ec2.internal origin-node[11690]: with error: exit status 1

resources:
      limits:
        cpu: 366m
        memory: 512Mi
      requests:
        cpu: 10m
        memory: 128Mi

This is a fresh 3.6.0 install from ansible.

@caruccio
Copy link

caruccio commented Sep 5, 2017

A little bit mor context:

[centos@m1 ~]$ oc version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://api-internal.infra.getupcloud.com:443
openshift v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7

Set 04 20:58:48 ip-10-0-2-188.ec2.internal origin-node[11690]: ValueFrom:nil} {Name:OPENSHIFT_DEPLOYMENT_NAME Value:apiqueue-8 ValueFrom:nil} {Name:OPENSHIFT_DEPLOYMENT_NAMESPACE Value:guiaon ValueFrom:nil}] Resources:{Limits:map[cpu:{i:{value:366 scale:-3} d:{Dec:<nil>} s:366m Format:DecimalSI} memory:{i:{value:268435456 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[cpu:{i:{value:10 scale:-3} d:{Dec:<nil>} s:10m Format:DecimalSI} memory:{i:{value:26214400 scale:0} d:{Dec:<nil>} s:25Mi Format:BinarySI}]} VolumeMounts:[{Name:deployer-token-r9cmk ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[KILL MKNOD SETGID SETUID SYS_CHROOT],},Privileged:*false,SELinuxOptions:&SELinuxOptions{User:,Role:,Type:,Level:s0:c14,c9,},RunAsUser:*1000200000,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.

Set 04 20:59:25 ip-10-0-2-188.ec2.internal dockerd-current[13102]: --> Scaling apiqueue-7 down to zero

Set 04 21:00:29 ip-10-0-2-188.ec2.internal origin-node[11690]: E0904 21:00:29.590877   11690 pod_workers.go:182] Error syncing pod 064ef43d-8f51-11e7-b766-028441715ba0 ("apiqueue-7-kr2z0_guiaon(064ef43d-8f51-11e7-b766-028441715ba0)"), skipping: error killing pod: failed to "KillContainer" for "php" with KillContainerError: "rpc error: code = 4 desc = context deadline exceeded"

Set 04 21:00:29 ip-10-0-2-188.ec2.internal origin-node[11690]: W0904 21:00:29.890910   11690 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "apiqueue-7-kr2z0_guiaon": Unexpected command output nsenter: cannot open : No such file or directory

Set 04 21:00:30 ip-10-0-2-188.ec2.internal origin-node[11690]: W0904 21:00:30.286638   11690 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "apiqueue-7-kr2z0_guiaon": Unexpected command output nsenter: cannot open : No such file or directory

Set 04 21:06:12 ip-10-0-2-188.ec2.internal origin-node[11690]: W0904 21:06:12.803988   11690 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "apiqueue-7-kr2z0_guiaon": Unexpected command output nsenter: cannot open : No such file or directory

Set 04 21:09:25 ip-10-0-2-188.ec2.internal dockerd-current[13102]: --> Scaling apiqueue-8 to 1 before performing acceptance check
Set 04 21:09:25 ip-10-0-2-188.ec2.internal dockerd-current[13102]: --> Waiting up to 10m0s for pods in rc apiqueue-8 to become ready

Set 04 21:19:25 ip-10-0-2-188.ec2.internal dockerd-current[13102]: error: update acceptor rejected apiqueue-8: pods for rc "apiqueue-8" took longer than 600 seconds to become ready

Set 04 21:19:29 ip-10-0-2-188.ec2.internal origin-node[11690]: W0904 21:19:29.230694   11690 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "apiqueue-8-deploy_guiaon": Unexpected command output nsenter: cannot open : No such file or directory

Restarting docker resolved the issue.

@caruccio
Copy link

caruccio commented Sep 5, 2017

In fact it's happening on all node hosts. There are pods Terminating as of 13d.

@makentenza
Copy link

@caruccio This issue is just related with Builds, as there was a typo in the documentation around defaults resources for Builds. As I can see you are facing this issue with deployments, so your typo would probably be configured in a different place. Check if you have configured resource limits in any other place, or just test the deployment in a new namespace with no resource limits applied and check what happens.

@caruccio
Copy link

caruccio commented Sep 6, 2017

@makentenza I've tried on a new project, no limits/requests as you suggested, but still getting the same result:

Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: I0906 23:02:02.112803   15384 kuberuntime_manager.go:458] Container {Name:deployment Image:openshift/origin-deployer:v3.6.0 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:KUBERNETES_MASTER Value:https://ip-10-0-3-166.ec2.internal ValueFrom:nil} {Name:OPENSHIFT_MASTER Value:https://ip-10-0-3-166.ec2.internal ValueFrom:nil} {Name:BEARER_TOKEN_FILE Value:/var/run/secrets/kubernetes.io/serviceaccount/token ValueFrom:nil} {Name:OPENSHIFT_CA_DATA Value:-----BEGIN CERTIFICATE-----
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: -----END CERTIFICATE-----
Set 06 23:02:02 ip-10-0-1-188.ec2.internal origin-node[15384]: ValueFrom:nil} {Name:OPENSHIFT_DEPLOYMENT_NAME Value:testlimit-1 ValueFrom:nil} {Name:OPENSHIFT_DEPLOYMENT_NAMESPACE Value:testlimit ValueFrom:nil}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:deployer-token-68jz7 ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath:}] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[KILL MKNOD SETGID SETUID SYS_CHROOT],},Privileged:*false,SELinuxOptions:&SELinuxOptions{User:,Role:,Type:,Level:s0:c45,c35,},RunAsUser:*1002050000,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
...
Set 06 23:02:22 ip-10-0-1-188.ec2.internal origin-node[15384]: W0906 23:02:22.449586   15384 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "testlimit-1-build_testlimit": Unexpected command output nsenter: cannot open : No such file or directory

Note there is no limits on the pod definition Resources:{Limits:map[] Requests:map[]}.
Do you believe it's another possible bug?

@caruccio
Copy link

caruccio commented Sep 7, 2017

I believe I've figured it out. That nsenter error message appears after container pod is destroyed, thus nsenter can't find process network from file /proc//ns/net in order to inform pod status do controller (I guess).

@smarterclayton
Copy link
Contributor

I think it's likely that we shouldn't be reporting this error. @sjenning something we should open upstream.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 19, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1
Projects
None yet
Development

No branches or pull requests

8 participants