Fix terminating containers #102

l0rd · 2022-03-24T14:44:01Z

Containers components should be non-terminating (c.f. point n.1 here).

This PR adds a script to check that containers components are non terminating (./tests/check_non_terminating.sh, used as PR check too):

It also patches the registry devfiles that failed to pass the test with:

container:
   image: (...)
+  args: ["tail", "-f", "/dev/null"]

or

container:
   image: (...)
+  command: ["tail", "-f", "/dev/null"]

This is the list of patched devfiles:

stacks/dotnet50/devfile.yaml
stacks/dotnet60/devfile.yaml
stacks/dotnetcore31/devfile.yaml
stacks/java-openliberty-gradle/devfile.yaml
stacks/java-openliberty/devfile.yaml
stacks/java-quarkus/devfile.yaml
stacks/java-websphereliberty-gradle/devfile.yaml
stacks/java-websphereliberty/devfile.yaml
stacks/java-wildfly-bootable-jar/devfile.yaml
stacks/nodejs-angular/devfile.yaml
stacks/nodejs-nextjs/devfile.yaml
stacks/nodejs-nuxtjs/devfile.yaml
stacks/nodejs-react/devfile.yaml
stacks/nodejs-svelte/devfile.yaml
stacks/nodejs-vue/devfile.yaml
stacks/nodejs/devfile.yaml
stacks/php-laravel/devfile.yaml
stacks/python-django/devfile.yaml
stacks/python/devfile.yaml

Which issue(s) this PR fixes:

devfile/api#681

PR acceptance criteria:

Contributing guide
Test automation
Documentation

How to test changes / Special notes to the reviewer:

I have added the following lines in the repo README file Running the tests section:

From the root of this repository, run tests/check_non_terminating.sh.
The test script will validate each devfile stack under stacks/, verifying that the components of type container are terminating.
The test script retrieves the image, command and args of a container component and uses them to run a pod and wait until it reaches the Running state:
kubectl run test-terminating -n default --attach=false --restart=Never --image="<image>" --command=true -- "<command>" "<args>"
Each container component must be non-terminating. If the default image entrypoint is terminating an args (preferred) or command should be specified in the defile (e.g. ["tail", "-f", "/dev/null"]).

openshift-ci · 2022-03-24T14:44:10Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

l0rd · 2022-03-24T14:47:03Z

/test all

johnmcollier · 2022-03-24T15:17:32Z

It looks like the changes aren't playing nicely with odo:

E.g.:

the supervisord program devrun not found

Syncing to component java-websphereliberty-gradle
 •  Checking files for pushing  ...

 ✓  Checking files for pushing [3ms]
 •  Syncing files to the component  ...

 ✓  Syncing files to the component [248ms]
 ✗  Failed to start component with name "java-websphereliberty-gradle". Error: Failed to create the component: the supervisord program devrun not found
ERROR push failed

All of the stacks are failing with the same error.

@l0rd @kadel

amisevsk · 2022-03-25T15:31:15Z

stacks/dotnet50/devfile.yaml

@@ -25,6 +25,7 @@ components:
 - name: dotnet
  container:
    image: registry.access.redhat.com/ubi8/dotnet-50:5.0
+    command: ["tail", "-f", "/dev/null"]


Generally, the safer approach is to override args rather than command in order to allow any entrypoint/setup script to run. This will generally work as most containers have an entrypoint along the lines of

#!/bin/bash # do necessary setup, start daemons, etc. exec "$@"

For e.g. the dotnet image here, the default command is

#!/bin/bash set -e source /opt/app-root/etc/generate_container_user # .NET uses libcurl for HTTP handling. libcurl doesn't use 'HTTP_PROXY', but uses 'http_proxy'. # libcurl is not using the upper-case to avoid exploiting httpoxy in CGI-like environments (https://httpoxy.org/). # CGI-like environments must be fixed for this exploit (https://access.redhat.com/security/vulnerabilities/httpoxy). # In an OpenShift context, it is common to use the upper-case 'HTTP_PROXY'. if [ -z "$http_proxy" ] && [ ! -z "$HTTP_PROXY" ]; then export http_proxy="${HTTP_PROXY}" fi # Trust certificates from DOTNET_SSL_DIRS. if [ -n "$DOTNET_SSL_DIRS" ]; then # The main process (PID 1) sets up a certificate folder. The other processes use it. if [ $$ -eq 1 ]; then source /opt/app-root/etc/trust_ssl_dirs else export SSL_CERT_DIR="$DOTNET_SSL_CERT_DIR" fi fi cmd="$1"; shift exec $cmd "$@"

The only time setting args instead of command will fail is when the default entrypoint for an image ignores args, but that case should be rare.

This is also widely used in the old Che devfile v1 images, e.g. https://github.com/eclipse-che/che-devfile-registry/blob/main/dockerfiles/entrypoint.sh

@amisevsk thanks I will test that.

After using args instead of command these stacks are still failing:

nodejs-nextjs nodejs java-quarkus php-laravel go nodejs-nuxtjs nodejs-vue python java-websphereliberty dotnet50 nodejs-react dotnet60 nodejs-svelte java-websphereliberty-gradle java-openliberty-gradle java-openliberty nodejs-angular dotnetcore31

Here's a summary of the entrypoint for each image:

STACK NAME IMAGE ENTRYPOINT

nodejs-nextjs node:lts-slim ["docker-entrypoint.sh"]

nodejs registry.access.redhat.com/ubi8/nodejs-14:latest ["container-entrypoint"]

java-quarkus registry.access.redhat.com/ubi8/openjdk-11 null

php-laravel composer:2.1.11 ["/docker-entrypoint.sh"]

go golang:latest null

nodejs-nuxtjs node:lts ["docker-entrypoint.sh"]

nodejs-vue node:lts-slim ["docker-entrypoint.sh"]

python quay.io/eclipse/che-python-3.7:nightly ["/entrypoint.sh"]

java-websphereliberty icr.io/appcafe/websphere-liberty-devfile-stack:22.0.0.1 null

dotnet50 registry.access.redhat.com/ubi8/dotnet-50:5.0 ["container-entrypoint"]

nodejs-react node:lts-slim ["docker-entrypoint.sh"]

dotnet60 registry.access.redhat.com/ubi8/dotnet-60:6.0 ["container-entrypoint"]

nodejs-svelte node:lts-slim ["docker-entrypoint.sh"]

java-websphereliberty-gradle icr.io/appcafe/websphere-liberty-devfile-stack:22.0.0.1-gradle null

java-openliberty-gradle icr.io/appcafe/open-liberty-devfile-stack:22.0.0.1-gradle null

java-openliberty icr.io/appcafe/open-liberty-devfile-stack:22.0.0.1 null

nodejs-angular node:lts-slim ["docker-entrypoint.sh"]

dotnetcore31 registry.access.redhat.com/ubi8/dotnet-31:3.1 ["container-entrypoint"]

Checking a few of these, it seems they should be fine if args is set to ["tail", "-f", "/dev/null"] and command is left unchanged (e.g. the node:lts and node:lts-slim images use an entrypoint that should just exec the args) -- running docker run -it --rm $image tail -f /dev/null does result in a non-terminating container.

I don't know if odo is doing anything fancy for these images, but if it was always just overriding the command/args then it should work as far as I can tell.

johnmcollier · 2022-04-21T20:50:48Z

/retest

johnmcollier · 2022-04-22T16:31:45Z

CI now passing with odo v2.5.1 👍

johnmcollier · 2022-04-22T16:32:00Z

@l0rd Can this be moved out of Draft / WIP now?

kadel · 2022-04-26T09:17:16Z

@l0rd Can this be moved out of Draft / WIP now?

no, please don't merge this yet.

The majority of users still use 2.5.0. and more importantly, the OpenShift connector (vscode plugin) is still based on 2.5.0. If you merge this now, it will break them.

l0rd · 2022-04-26T09:32:49Z

@kadel I will wait for your approval to merge this

Some Devfiles in the default registry defines terminating container components. As such, they cannot be used as is without overriding container entrypoints with Supervisord. In other words, those cases cannot be tested with the `--no-supervisord` flag. Note: We would need to revert this once [1] is merged into devfile/registry. [1] devfile/registry#102

As discussed in [1], this will alleviate the maintenance burden, while waiting for [2] to get merged some day on the Devfile side. [1] redhat-developer#5768 (comment) [2] devfile/registry#102

This temporary hack overrides the container entrypoint with "tail -f /dev/null" if the component defines no command or args (in which case we should have used whatever is defined in the Devfile, per the specification). As discussed in [1], this allows us to get rid of Supervisord right away without us having to wait until [2] is merged on the Devfile side. [1] redhat-developer#5768 (comment) [2] devfile/registry#102

To do this, the idea is to start the container component: 1- using the command/args defined in the Devfile 2- using whatever was defined in the container image if there is no command/args defined in the Devfile Then, once the container is started, we would execute the Devfile commands directly in the container component, just like a simple 'kubectl exec' command would do. Since this is a long-running command (and potentially never ending), we would need to run it in the background, i.e. in a side goroutine. Point 2) above requires implementing a temporary hack (as discussed in [1]), without us having to wait for [2] to be merged on the Devfile side. This temporary hack overrides the container entrypoint with "tail -f /dev/null" if the component defines no command or args (in which case we should have used whatever is defined in the image, per the specification). [1] redhat-developer#5768 (comment) [2] devfile/registry#102

openshift-ci · 2022-06-16T10:08:28Z

@l0rd: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

) * Introduce new 'pkg/remotecmd' package This package allows to execute commands in remote packages and exposes an interface for managing processes associated to given Devfile commands. * Rely on 'pkg/libdevfile' as much as possible for Devfile command execution This requires passing a handler at the odo side, which in turns uses the 'pkg/remotecmd' package to run commands in remote containers. * Switch to running without Supervisord as PID 1 in containers To do this, the idea is to start the container component: 1- using the command/args defined in the Devfile 2- using whatever was defined in the container image if there is no command/args defined in the Devfile Then, once the container is started, we would execute the Devfile commands directly in the container component, just like a simple 'kubectl exec' command would do. Since this is a long-running command (and potentially never ending), we would need to run it in the background, i.e. in a side goroutine. Point 2) above requires implementing a temporary hack (as discussed in [1]), without us having to wait for [2] to be merged on the Devfile side. This temporary hack overrides the container entrypoint with "tail -f /dev/null" if the component defines no command or args (in which case we should have used whatever is defined in the image, per the specification). [1] #5768 (comment) [2] devfile/registry#102 * Rename K8s adapter struct 'client' field into 'kubeClient', as suggested in review * Rename sync adapter struct 'client' fields to better distinguish between them * Make sure messages displayed to users running 'odo dev' are the same * Update temporary hack log message Co-authored-by: Philippe Martin <[email protected]> * Make sure to handle process output line by line, for performance purposes * Handle remote process output and errors in the Devfile command handler The implementation in kubeexec.go should remain as generic as possible * Keep retrying remote process status until timeout, rather than just waiting for 1 sec Now that the command is run via a goroutine, there might be some situations where we were checking the status just before the goroutine had a chance to start. * Handle remote process output and errors in the Devfile command handler The implementation in kubeexec.go should remain as generic as possible * Update kubeexec StopProcessForCommand implementation such that it relies on /proc to kill the parent children processes * Ignore missing children file in getProcessChildren * Unit-test methods in kubexec.go * Fix missing logs when build command does not pass when running 'odo dev' Also add integration test case * Fix spinner status when commands passed to exec_handler do not pass * Make sure to check process status right after stopping it The process just stopped might take longer to exit (it might have caught the signal and is performing additional cleanup) * Keep retrying remote process status until timeout, rather than just waiting for 1 sec Now that the command is run via a goroutine, there might be some situations where we were checking the status just before the goroutine had a chance to start. * Fix potential deadlock when reading output from remotecmd#ExecuteCommandAndGetOutput Rely on the same logic in ExecuteCommand * Add more unit tests * Remove block that used to check debug port from env info As commented out in [1], we don't store anymore the debug port value in the ENV file. [1] #5768 (comment) * Rename 'getCommandFromFlag' into 'getCommandByName', as suggested in review * Make remotecmd package more generic This package no longer depends on Devfile-related packages. * Fix comments in libdevfile.go * Move errorIfTimeout struct field as parameter of RetryWithSchedule This boolean is tied to the given retry schedule, so it makes sense for it to be passed with the schedule. * Expose a single ExecuteCommand function that returns both stdout and stderr Co-authored-by: Philippe Martin <[email protected]>

elsony · 2022-09-19T20:24:38Z

It looks like the test failure is due to the recent issues with the OpenShift CI

amisevsk

Tested the script and it works as expected. I've left a number of optional comments, feel free to ignore/defer to a different PR.

amisevsk · 2022-09-19T22:03:52Z