|
1 |
| -#Support "no new privileges" in Kubernetes |
| 1 | +# No New Privileges |
2 | 2 |
|
3 |
| -##Description |
| 3 | +- [Description](#description) |
| 4 | + * [Interactions with other Linux primitives](#interactions-with-other-linux-primitives) |
| 5 | +- [Current Implementations](#current-implementations) |
| 6 | + * [Support in Docker](#support-in-docker) |
| 7 | + * [Support in rkt](#support-in-rkt) |
| 8 | + * [Support in OCI runtimes](#support-in-oci-runtimes) |
| 9 | +- [Existing SecurityContext objects](#existing-securitycontext-objects) |
| 10 | +- [Changes of SecurityContext objects](#changes-of-securitycontext-objects) |
4 | 11 |
|
5 |
| -In Linux, the `execve` system call can grant more privileges to a newly-created process than its parent process. Considering security issues, since Linux kernel v3.5, there is a new flag named `no_new_privs` added to prevent those new privileges from being granted to the processes. |
| 12 | +## Description |
6 | 13 |
|
7 |
| -`no_new_privs` is inherited across `fork`, `clone` and `execve` and can not be unset. With `no_new_privs` set, `execve` promises not to grant the privilege to do anything that could not have been done without the `execve` call. |
| 14 | +In Linux, the `execve` system call can grant more privileges to a newly-created |
| 15 | +process than its parent process. Considering security issues, since Linux kernel |
| 16 | +v3.5, there is a new flag named `no_new_privs` added to prevent those new |
| 17 | +privileges from being granted to the processes. |
8 | 18 |
|
9 |
| -For more details about `no_new_privs`, please check the Linux kernel document [here](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt). |
| 19 | +[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt) |
| 20 | +is inherited across `fork`, `clone` and `execve` and can not be unset. With |
| 21 | +`no_new_privs` set, `execve` promises not to grant the privilege to do anything |
| 22 | +that could not have been done without the `execve` call. |
10 | 23 |
|
11 |
| -Docker started to support `no_new_privs` option since 1.11. Here is the [link](https://github.com/docker/docker/issues/20329) of the ticket in Docker community to support `no_new_privs` option. |
| 24 | +For more details about `no_new_privs`, please check the |
| 25 | +[Linux kernel documention](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt). |
12 | 26 |
|
13 |
| -We want to support the creation of containers with `no_new_privs` enabled in Kubernetes, which will make the Kubernetes cluster more safe. Here is the [link](https://github.com/kubernetes/kubernetes/issues/38417) of the ticket in Kubernetes community to track this proposal. |
| 27 | +This is different from `NOSUID` in that `no_new_privs`can give permission to |
| 28 | +the container process to further restrict child processes with seccomp. This |
| 29 | +permission goes only one-way in that the container process can not grant more |
| 30 | +permissions, only further restrict. |
14 | 31 |
|
| 32 | +### Interactions with other Linux primitives |
15 | 33 |
|
16 |
| -##Current implementation |
| 34 | +- suid binaries: will break when `no_new_privs` is enabled |
| 35 | +- seccomp2 as a non root user: requires `no_new_privs` |
| 36 | +- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs` |
| 37 | +- ambient capabilities: requires `no_new_privs` |
| 38 | +- selinux transactions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969) |
17 | 39 |
|
18 |
| -###Support in Docker |
19 | 40 |
|
20 |
| -Since Docker 1.11, user can specify `--security-opt` to enable `no_new_privs` while creating containers, e.g. `docker run --security-opt=no-new-privileges busybox` |
| 41 | +## Current Implementations |
21 | 42 |
|
22 |
| -For program client, Docker provides an object named `ContainerCreateConfig` defined in package `github.com/docker/engine-api/types` to config container creation parameters. In this object, there is a string array `HostConfig.SecurityOpt` to specify the security options. Client can utilize this field to specify the arguments for security options while creating new containers. |
| 43 | +### Support in Docker |
23 | 44 |
|
24 |
| -###Support in OCI runtimes |
| 45 | +Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs` |
| 46 | +while creating containers, for example |
| 47 | +`docker run --security-opt=no_new_privs busybox`. |
25 | 48 |
|
26 |
| -Since version 0.3.0 of the OCI runtime specification, a user can specify the `noNewPrivs` boolean flag in the configuration file. |
| 49 | +Docker provides via their Go api an object named `ContainerCreateConfig` to |
| 50 | +configure container creation parameters. In this object, there is a string |
| 51 | +array `HostConfig.SecurityOpt` to specify the security options. Client can |
| 52 | +utilize this field to specify the arguments for security options while |
| 53 | +creating new containers. |
27 | 54 |
|
28 |
| -More details of OCI implementation can be checked [here](https://github.com/opencontainers/runtime-spec/pull/290). |
| 55 | +This field did not scale well for the Docker client, so it's suggested that |
| 56 | +Kubernetes does not follow that design. |
29 | 57 |
|
30 |
| -###SecurityContext in Kubernetes |
| 58 | +More details of the Docker implementation can be read |
| 59 | +[here](https://github.com/moby/moby/pull/20727) as well as the original |
| 60 | +discussion [here](https://github.com/moby/moby/issues/20329). |
31 | 61 |
|
32 |
| -Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` for `PodSpec`. `SecurityContext` objects define the related security options for Kubernetes containers, e.g. selinux options. |
| 62 | +### Support in rkt |
33 | 63 |
|
34 |
| -While creating a container, kubelet parses the security context object and formats the security option strings for Docker. The security options strings will finally be inserted into `ContainerCreateConfig.HostConfig.SecurityOpt` and passed to Docker. Different Kubernetes runtimes now are using different methods to parse and format the security option strings: |
35 |
| -* method `#getSecurityOpts` in `docker_mager_xxxx.go` for Docker runtime |
36 |
| -* method `#getContainerSecurityOpts` in `docker_container.go` for CRI |
| 64 | +Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt. |
37 | 65 |
|
| 66 | +More details of the rkt implementation can be read |
| 67 | +[here](https://github.com/rkt/rkt/pull/2677). |
38 | 68 |
|
39 |
| -##Proposal to support "no new privileges" |
| 69 | +### Support in OCI runtimes |
40 | 70 |
|
41 |
| -To support "no new privileges" options in Kubernetes, it is proposed to make the following changes: |
| 71 | +Since version 0.3.0 of the OCI runtime specification, a user can specify the |
| 72 | +`noNewPrivs` boolean flag in the configuration file. |
42 | 73 |
|
43 |
| -###Changes of SecurityContext objects |
| 74 | +More details of the OCI implementation can be read |
| 75 | +[here](https://github.com/opencontainers/runtime-spec/pull/290). |
44 | 76 |
|
45 |
| -Add a new bool type field named `noNewPrivileges` to both `SecurityContext` definition and `PodSecurityContext` definition: |
46 |
| -* `noNewPrivileges=true` in `PodSecurityContext` means that all the containers in the pod should be run with `no-new-privileges` enabled. This should be a pod level control of `no-new-privileges` flag. |
47 |
| -* `noNewPrivileges` in `SecurityContext` is a container level control of `no-new-privileges` flag, and can override the pod level `noNewPrivileges` setting. |
| 77 | +## Existing SecurityContext objects |
48 | 78 |
|
49 |
| -By default, `noNewPrivileges` is `false`. |
| 79 | +Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` |
| 80 | +for `PodSpec`. `SecurityContext` objects define the related security options |
| 81 | +for Kubernetes containers, e.g. selinux options. |
50 | 82 |
|
51 |
| -The change of security context API objects requires the update of corresponding Kubernetes documents, need to submit another PR to track this. |
| 83 | +To support "no new privileges" options in Kubernetes, it is proposed to make |
| 84 | +the following changes: |
52 | 85 |
|
53 |
| -###Changes of docker runtime |
| 86 | +## Changes of SecurityContext objects |
54 | 87 |
|
55 |
| -When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getSecurityOpts` method in `docker_manager_xxx.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt` |
| 88 | +Add a new bool type field named `allowPrivilegeEscalation` to both `SecurityContext` |
| 89 | +definition and `PodSecurityContext` definition: |
56 | 90 |
|
57 |
| -###Changes of CRI runtime |
| 91 | +* `allowPrivilegeEscalation=true` in `PodSecurityContext` means that all the |
| 92 | +containers in the pod should **NOT** be run with `no_new_privs` enabled. |
| 93 | +This enables pod level control of `no_new_privs` flag. |
| 94 | +* `allowPrivilegeEscalation` in `SecurityContext` provides container level |
| 95 | +control of the `no_new_privs` flag and can override the pod level |
| 96 | +`allowPrivilegeEscalation` setting. |
58 | 97 |
|
59 |
| -When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getContainerSecurityOpts` method in `docker_container.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt` |
| 98 | +By default, `allowPrivilegeEscalation` is `false` with the following exceptions: |
60 | 99 |
|
61 |
| -###Changes of kubectl |
| 100 | +- when a container is `privileged` |
| 101 | +- when `CAP_SYS_ADMIN` is added to a container |
| 102 | +- when a container is not run as root, uid `0` (to prevent breaking suid |
| 103 | + binaries) |
62 | 104 |
|
63 |
| -This is an additional proposal for kubectl. To improve kubectl user experience, we can add a new flag for kubectl command named `--security-opt`. This flag allows user to create pod with security options configured when using `kubectl run` command. For example, if user issues command like `kubectl run busybox --image=busybox --security-opt=no-new-privileges -- top`, kubernetes shall create a pod with `noNewPrivileges` enabled. |
64 |
| - |
65 |
| -If the proposal of kubectl changes is accepted, the patch can also be submitted as a separate PR. |
| 105 | +This requires changes to the Docker, rkt, and CRI runtime integrations so that |
| 106 | +kubelet will add the specific `no_new_privs` option. |
0 commit comments