-
Notifications
You must be signed in to change notification settings - Fork 93
Fix terminating containers #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Skipping CI for Draft Pull Request. |
/test all |
It looks like the changes aren't playing nicely with odo: E.g.:
All of the stacks are failing with the same error. |
stacks/dotnet50/devfile.yaml
Outdated
@@ -25,6 +25,7 @@ components: | |||
- name: dotnet | |||
container: | |||
image: registry.access.redhat.com/ubi8/dotnet-50:5.0 | |||
command: ["tail", "-f", "/dev/null"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, the safer approach is to override args
rather than command
in order to allow any entrypoint/setup script to run. This will generally work as most containers have an entrypoint along the lines of
#!/bin/bash
# do necessary setup, start daemons, etc.
exec "$@"
For e.g. the dotnet image here, the default command is
#!/bin/bash
set -e
source /opt/app-root/etc/generate_container_user
# .NET uses libcurl for HTTP handling. libcurl doesn't use 'HTTP_PROXY', but uses 'http_proxy'.
# libcurl is not using the upper-case to avoid exploiting httpoxy in CGI-like environments (https://httpoxy.org/).
# CGI-like environments must be fixed for this exploit (https://access.redhat.com/security/vulnerabilities/httpoxy).
# In an OpenShift context, it is common to use the upper-case 'HTTP_PROXY'.
if [ -z "$http_proxy" ] && [ ! -z "$HTTP_PROXY" ]; then
export http_proxy="${HTTP_PROXY}"
fi
# Trust certificates from DOTNET_SSL_DIRS.
if [ -n "$DOTNET_SSL_DIRS" ]; then
# The main process (PID 1) sets up a certificate folder. The other processes use it.
if [ $$ -eq 1 ]; then
source /opt/app-root/etc/trust_ssl_dirs
else
export SSL_CERT_DIR="$DOTNET_SSL_CERT_DIR"
fi
fi
cmd="$1"; shift
exec $cmd "$@"
The only time setting args
instead of command
will fail is when the default entrypoint for an image ignores args, but that case should be rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also widely used in the old Che devfile v1 images, e.g. https://github.com/eclipse-che/che-devfile-registry/blob/main/dockerfiles/entrypoint.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amisevsk thanks I will test that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After using args
instead of command
these stacks are still failing:
nodejs-nextjs nodejs java-quarkus php-laravel go nodejs-nuxtjs nodejs-vue python java-websphereliberty dotnet50 nodejs-react dotnet60 nodejs-svelte java-websphereliberty-gradle java-openliberty-gradle java-openliberty nodejs-angular dotnetcore31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a summary of the entrypoint for each image:
STACK NAME | IMAGE | ENTRYPOINT |
---|---|---|
nodejs-nextjs | node:lts-slim | ["docker-entrypoint.sh"] |
nodejs | registry.access.redhat.com/ubi8/nodejs-14:latest | ["container-entrypoint"] |
java-quarkus | registry.access.redhat.com/ubi8/openjdk-11 | null |
php-laravel | composer:2.1.11 | ["/docker-entrypoint.sh"] |
go | golang:latest | null |
nodejs-nuxtjs | node:lts | ["docker-entrypoint.sh"] |
nodejs-vue | node:lts-slim | ["docker-entrypoint.sh"] |
python | quay.io/eclipse/che-python-3.7:nightly | ["/entrypoint.sh"] |
java-websphereliberty | icr.io/appcafe/websphere-liberty-devfile-stack:22.0.0.1 | null |
dotnet50 | registry.access.redhat.com/ubi8/dotnet-50:5.0 | ["container-entrypoint"] |
nodejs-react | node:lts-slim | ["docker-entrypoint.sh"] |
dotnet60 | registry.access.redhat.com/ubi8/dotnet-60:6.0 | ["container-entrypoint"] |
nodejs-svelte | node:lts-slim | ["docker-entrypoint.sh"] |
java-websphereliberty-gradle | icr.io/appcafe/websphere-liberty-devfile-stack:22.0.0.1-gradle | null |
java-openliberty-gradle | icr.io/appcafe/open-liberty-devfile-stack:22.0.0.1-gradle | null |
java-openliberty | icr.io/appcafe/open-liberty-devfile-stack:22.0.0.1 | null |
nodejs-angular | node:lts-slim | ["docker-entrypoint.sh"] |
dotnetcore31 | registry.access.redhat.com/ubi8/dotnet-31:3.1 | ["container-entrypoint"] |
Checking a few of these, it seems they should be fine if args
is set to ["tail", "-f", "/dev/null"]
and command
is left unchanged (e.g. the node:lts and node:lts-slim images use an entrypoint that should just exec
the args) -- running docker run -it --rm $image tail -f /dev/null
does result in a non-terminating container.
I don't know if odo
is doing anything fancy for these images, but if it was always just overriding the command/args then it should work as far as I can tell.
/retest |
CI now passing with odo v2.5.1 👍 |
@l0rd Can this be moved out of Draft / WIP now? |
no, please don't merge this yet. The majority of users still use 2.5.0. and more importantly, the OpenShift connector (vscode plugin) is still based on 2.5.0. If you merge this now, it will break them. |
@kadel I will wait for your approval to merge this |
Some Devfiles in the default registry defines terminating container components. As such, they cannot be used as is without overriding container entrypoints with Supervisord. In other words, those cases cannot be tested with the `--no-supervisord` flag. Note: We would need to revert this once [1] is merged into devfile/registry. [1] devfile/registry#102
Some Devfiles in the default registry defines terminating container components. As such, they cannot be used as is without overriding container entrypoints with Supervisord. In other words, those cases cannot be tested with the `--no-supervisord` flag. Note: We would need to revert this once [1] is merged into devfile/registry. [1] devfile/registry#102
Some Devfiles in the default registry defines terminating container components. As such, they cannot be used as is without overriding container entrypoints with Supervisord. In other words, those cases cannot be tested with the `--no-supervisord` flag. Note: We would need to revert this once [1] is merged into devfile/registry. [1] devfile/registry#102
Some Devfiles in the default registry defines terminating container components. As such, they cannot be used as is without overriding container entrypoints with Supervisord. In other words, those cases cannot be tested with the `--no-supervisord` flag. Note: We would need to revert this once [1] is merged into devfile/registry. [1] devfile/registry#102
Some Devfiles in the default registry defines terminating container components. As such, they cannot be used as is without overriding container entrypoints with Supervisord. In other words, those cases cannot be tested with the `--no-supervisord` flag. Note: We would need to revert this once [1] is merged into devfile/registry. [1] devfile/registry#102
As discussed in [1], this will alleviate the maintenance burden, while waiting for [2] to get merged some day on the Devfile side. [1] redhat-developer#5768 (comment) [2] devfile/registry#102
This temporary hack overrides the container entrypoint with "tail -f /dev/null" if the component defines no command or args (in which case we should have used whatever is defined in the Devfile, per the specification). As discussed in [1], this allows us to get rid of Supervisord right away without us having to wait until [2] is merged on the Devfile side. [1] redhat-developer#5768 (comment) [2] devfile/registry#102
To do this, the idea is to start the container component: 1- using the command/args defined in the Devfile 2- using whatever was defined in the container image if there is no command/args defined in the Devfile Then, once the container is started, we would execute the Devfile commands directly in the container component, just like a simple 'kubectl exec' command would do. Since this is a long-running command (and potentially never ending), we would need to run it in the background, i.e. in a side goroutine. Point 2) above requires implementing a temporary hack (as discussed in [1]), without us having to wait for [2] to be merged on the Devfile side. This temporary hack overrides the container entrypoint with "tail -f /dev/null" if the component defines no command or args (in which case we should have used whatever is defined in the image, per the specification). [1] redhat-developer#5768 (comment) [2] devfile/registry#102
To do this, the idea is to start the container component: 1- using the command/args defined in the Devfile 2- using whatever was defined in the container image if there is no command/args defined in the Devfile Then, once the container is started, we would execute the Devfile commands directly in the container component, just like a simple 'kubectl exec' command would do. Since this is a long-running command (and potentially never ending), we would need to run it in the background, i.e. in a side goroutine. Point 2) above requires implementing a temporary hack (as discussed in [1]), without us having to wait for [2] to be merged on the Devfile side. This temporary hack overrides the container entrypoint with "tail -f /dev/null" if the component defines no command or args (in which case we should have used whatever is defined in the image, per the specification). [1] redhat-developer#5768 (comment) [2] devfile/registry#102
To do this, the idea is to start the container component: 1- using the command/args defined in the Devfile 2- using whatever was defined in the container image if there is no command/args defined in the Devfile Then, once the container is started, we would execute the Devfile commands directly in the container component, just like a simple 'kubectl exec' command would do. Since this is a long-running command (and potentially never ending), we would need to run it in the background, i.e. in a side goroutine. Point 2) above requires implementing a temporary hack (as discussed in [1]), without us having to wait for [2] to be merged on the Devfile side. This temporary hack overrides the container entrypoint with "tail -f /dev/null" if the component defines no command or args (in which case we should have used whatever is defined in the image, per the specification). [1] redhat-developer#5768 (comment) [2] devfile/registry#102
To do this, the idea is to start the container component: 1- using the command/args defined in the Devfile 2- using whatever was defined in the container image if there is no command/args defined in the Devfile Then, once the container is started, we would execute the Devfile commands directly in the container component, just like a simple 'kubectl exec' command would do. Since this is a long-running command (and potentially never ending), we would need to run it in the background, i.e. in a side goroutine. Point 2) above requires implementing a temporary hack (as discussed in [1]), without us having to wait for [2] to be merged on the Devfile side. This temporary hack overrides the container entrypoint with "tail -f /dev/null" if the component defines no command or args (in which case we should have used whatever is defined in the image, per the specification). [1] redhat-developer#5768 (comment) [2] devfile/registry#102
@l0rd: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
) * Introduce new 'pkg/remotecmd' package This package allows to execute commands in remote packages and exposes an interface for managing processes associated to given Devfile commands. * Rely on 'pkg/libdevfile' as much as possible for Devfile command execution This requires passing a handler at the odo side, which in turns uses the 'pkg/remotecmd' package to run commands in remote containers. * Switch to running without Supervisord as PID 1 in containers To do this, the idea is to start the container component: 1- using the command/args defined in the Devfile 2- using whatever was defined in the container image if there is no command/args defined in the Devfile Then, once the container is started, we would execute the Devfile commands directly in the container component, just like a simple 'kubectl exec' command would do. Since this is a long-running command (and potentially never ending), we would need to run it in the background, i.e. in a side goroutine. Point 2) above requires implementing a temporary hack (as discussed in [1]), without us having to wait for [2] to be merged on the Devfile side. This temporary hack overrides the container entrypoint with "tail -f /dev/null" if the component defines no command or args (in which case we should have used whatever is defined in the image, per the specification). [1] #5768 (comment) [2] devfile/registry#102 * Rename K8s adapter struct 'client' field into 'kubeClient', as suggested in review * Rename sync adapter struct 'client' fields to better distinguish between them * Make sure messages displayed to users running 'odo dev' are the same * Update temporary hack log message Co-authored-by: Philippe Martin <[email protected]> * Make sure to handle process output line by line, for performance purposes * Handle remote process output and errors in the Devfile command handler The implementation in kubeexec.go should remain as generic as possible * Keep retrying remote process status until timeout, rather than just waiting for 1 sec Now that the command is run via a goroutine, there might be some situations where we were checking the status just before the goroutine had a chance to start. * Handle remote process output and errors in the Devfile command handler The implementation in kubeexec.go should remain as generic as possible * Update kubeexec StopProcessForCommand implementation such that it relies on /proc to kill the parent children processes * Ignore missing children file in getProcessChildren * Unit-test methods in kubexec.go * Fix missing logs when build command does not pass when running 'odo dev' Also add integration test case * Fix spinner status when commands passed to exec_handler do not pass * Make sure to check process status right after stopping it The process just stopped might take longer to exit (it might have caught the signal and is performing additional cleanup) * Keep retrying remote process status until timeout, rather than just waiting for 1 sec Now that the command is run via a goroutine, there might be some situations where we were checking the status just before the goroutine had a chance to start. * Fix potential deadlock when reading output from remotecmd#ExecuteCommandAndGetOutput Rely on the same logic in ExecuteCommand * Add more unit tests * Remove block that used to check debug port from env info As commented out in [1], we don't store anymore the debug port value in the ENV file. [1] #5768 (comment) * Rename 'getCommandFromFlag' into 'getCommandByName', as suggested in review * Make remotecmd package more generic This package no longer depends on Devfile-related packages. * Fix comments in libdevfile.go * Move errorIfTimeout struct field as parameter of RetryWithSchedule This boolean is tied to the given retry schedule, so it makes sense for it to be passed with the schedule. * Expose a single ExecuteCommand function that returns both stdout and stderr Co-authored-by: Philippe Martin <[email protected]>
It looks like the test failure is due to the recent issues with the OpenShift CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested the script and it works as expected. I've left a number of optional comments, feel free to ignore/defer to a different PR.
tests/check_non_terminating.sh
Outdated
|
||
if [ "${_int_command[*]}" == "null" ] && [ "${_int_command_args[*]}" == "null" ]; then | ||
echo " COMMAND: \"kubectl run test-terminating -n default --attach=false --restart=Never --image=$_int_image\"" | ||
2>/dev/null 1>/dev/null kubectl run test-terminating -n default --attach=false --restart=Never --image="$_int_image" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using 2>/dev/null 1>/dev/null
it might be slightly cleaner to append >/dev/null 2>&1
.
2>/dev/null 1>/dev/null kubectl run test-terminating -n default --attach=false --restart=Never --image="$_int_image" | |
kubectl run test-terminating -n default --attach=false --restart=Never --image="$_int_image" >/dev/null 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will update that.
tests/check_non_terminating.sh
Outdated
2>/dev/null 1>/dev/null kubectl run test-terminating -n default --attach=false --restart=Never --image="$_int_image" --command=true -- ${_int_command[*]} ${_int_command_args[*]} | ||
fi | ||
|
||
if 2>/dev/null 1>/dev/null kubectl wait pods -n ${namespace} test-terminating --for condition=Ready --timeout=${timeout_in_sec}s; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run commands hard-code the namespace (default
) whereas the variable is used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will fix that
tests/check_non_terminating.sh
Outdated
2>/dev/null 1>/dev/null kubectl run test-terminating -n default --attach=false --restart=Never --image="$_int_image" --command=true -- ${_int_command[*]} ${_int_command_args[*]} | ||
fi | ||
|
||
if 2>/dev/null 1>/dev/null kubectl wait pods -n ${namespace} test-terminating --for condition=Ready --timeout=${timeout_in_sec}s; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some fragility in this check -- a pod may briefly enter the Running state even if it is a terminating container, if e.g. it runs some entrypoint that does not resolve immediately. However, for now this likely sufficient.
I'm able to trick this test (most of the time, at least, on OpenShift) with
kubectl run test-terminating \
--attach=false \
--restart=Never \
--image="registry.access.redhat.com/ubi8/nodejs-16:latest" \
-- sleep 0.5s \
&& kubectl get po -w
but this could naturally arise if the pod attempts to start a server or similar.
For the future, perhaps a safer check would be to e.g. start a pod for each image in devfiles in parallel, then wait some period of time and verify that the pods still have the Ready condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. But it works with current stacks so I think that's acceptable for now.
tests/check_non_terminating.sh
Outdated
fi | ||
} | ||
|
||
YQ_PATH=yq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is left over from an earlier refactor -- should this be something like
YQ_PATH=yq | |
YQ_PATH=${YQ_PATH:-yq} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this would be very useful to have as this script uses mikefarah/yq
whereas my default installed binary is the kislyuk/yq
version. To use the script I would need to export YQ_PATH=...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
tests/check_non_terminating.sh
Outdated
|
||
if 2>/dev/null 1>/dev/null kubectl wait pods -n ${namespace} test-terminating --for condition=Ready --timeout=${timeout_in_sec}s; then | ||
echo " SUCCESS: The container started successfully and didn't terminate" | ||
2>/dev/null 1>/dev/null kubectl delete pod --force test-terminating -n default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the --force
argument, I get the warning
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
If the attempt is to make the script run more quickly, perhaps naming the containers after the image being tested and passing --wait=false
would be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I will remove --force
tests/check_non_terminating.sh
Outdated
_int_command=("$2") | ||
_int_command_args=("$3") | ||
|
||
namespace=default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a namespace specifically for the test might make cleanup easier, as you could just oc delete project test-terminating
to make sure everything is cleaned up. The default
namespace can have unexpected restrictions that are not present in non-default namespaces, and would prevent someone from running this test in a cluster where they are not cluster-admin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I will make it configurable (i.e. namespace=${TEST_NAMESPACE:-default}
)
@amisevsk thanks for the review, I will address the comments later today. |
Sorry, just saw this. Not sure what's going on with OpenShift CI, it seems like our test script isn't even getting executed. I've followed up with QE to see what's going on. |
Signed-off-by: Mario Loriedo <[email protected]>
Signed-off-by: Mario Loriedo <[email protected]>
Signed-off-by: Mario Loriedo <[email protected]>
Signed-off-by: Mario Loriedo <[email protected]>
699bdc9
to
7ac5532
Compare
Signed-off-by: Mario Loriedo <[email protected]>
Signed-off-by: Mario Loriedo <[email protected]>
Signed-off-by: Mario Loriedo <[email protected]>
Signed-off-by: Mario Loriedo <[email protected]>
@amisevsk I should have address most of your points in the last 4 commits, please have a look. |
@l0rd: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: amisevsk, l0rd The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Containers components should be non-terminating (c.f. point n.1 here).
This PR adds a script to check that containers components are non terminating (
./tests/check_non_terminating.sh
, used as PR check too):It also patches the registry devfiles that failed to pass the test with:
container: image: (...) + args: ["tail", "-f", "/dev/null"]
or
container: image: (...) + command: ["tail", "-f", "/dev/null"]
This is the list of patched devfiles:
Which issue(s) this PR fixes:
devfile/api#681
PR acceptance criteria:
How to test changes / Special notes to the reviewer:
I have added the following lines in the repo README file Running the tests section: