|
1 |
| -#Support "no new privileges" in Kubernetes |
| 1 | +# No New Privileges |
2 | 2 |
|
3 |
| -##Description |
| 3 | +- [Description](#description) |
| 4 | + * [Interactions with other Linux primitives](#interactions-with-other-linux-primitives) |
| 5 | +- [Current Implementations](#current-implementations) |
| 6 | + * [Support in Docker](#support-in-docker) |
| 7 | + * [Support in rkt](#support-in-rkt) |
| 8 | + * [Support in OCI runtimes](#support-in-oci-runtimes) |
| 9 | +- [Existing SecurityContext objects](#existing-securitycontext-objects) |
| 10 | +- [Changes of SecurityContext objects](#changes-of-securitycontext-objects) |
| 11 | +- [Default via Pod Security Policy](#default-via-pod-security-policy) |
4 | 12 |
|
5 |
| -In Linux, the `execve` system call can grant more privileges to a newly-created process than its parent process. Considering security issues, since Linux kernel v3.5, there is a new flag named `no_new_privs` added to prevent those new privileges from being granted to the processes. |
6 | 13 |
|
7 |
| -`no_new_privs` is inherited across `fork`, `clone` and `execve` and can not be unset. With `no_new_privs` set, `execve` promises not to grant the privilege to do anything that could not have been done without the `execve` call. |
| 14 | +## Description |
8 | 15 |
|
9 |
| -For more details about `no_new_privs`, please check the Linux kernel document [here](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt). |
| 16 | +In Linux, the `execve` system call can grant more privileges to a newly-created |
| 17 | +process than its parent process. Considering security issues, since Linux kernel |
| 18 | +v3.5, there is a new flag named `no_new_privs` added to prevent those new |
| 19 | +privileges from being granted to the processes. |
10 | 20 |
|
11 |
| -Docker started to support `no_new_privs` option since 1.11. Here is the [link](https://github.com/docker/docker/issues/20329) of the ticket in Docker community to support `no_new_privs` option. |
| 21 | +[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt) |
| 22 | +is inherited across `fork`, `clone` and `execve` and can not be unset. With |
| 23 | +`no_new_privs` set, `execve` promises not to grant the privilege to do anything |
| 24 | +that could not have been done without the `execve` call. |
12 | 25 |
|
13 |
| -We want to support the creation of containers with `no_new_privs` enabled in Kubernetes, which will make the Kubernetes cluster more safe. Here is the [link](https://github.com/kubernetes/kubernetes/issues/38417) of the ticket in Kubernetes community to track this proposal. |
| 26 | +For more details about `no_new_privs`, please check the |
| 27 | +[Linux kernel documention](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt). |
14 | 28 |
|
| 29 | +This is different from `NOSUID` in that `no_new_privs`can give permission to |
| 30 | +the container process to further restrict child processes with seccomp. This |
| 31 | +permission goes only one-way in that the container process can not grant more |
| 32 | +permissions, only further restrict. |
15 | 33 |
|
16 |
| -##Current implementation |
| 34 | +### Interactions with other Linux primitives |
17 | 35 |
|
18 |
| -###Support in Docker |
| 36 | +- suid binaries: will break when `no_new_privs` is enabled |
| 37 | +- seccomp2 as a non root user: requires `no_new_privs` |
| 38 | +- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs` |
| 39 | +- ambient capabilities: requires `no_new_privs` |
| 40 | +- selinux transitions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969) |
19 | 41 |
|
20 |
| -Since Docker 1.11, user can specify `--security-opt` to enable `no_new_privs` while creating containers, e.g. `docker run --security-opt=no-new-privileges busybox` |
21 | 42 |
|
22 |
| -For program client, Docker provides an object named `ContainerCreateConfig` defined in package `github.com/docker/engine-api/types` to config container creation parameters. In this object, there is a string array `HostConfig.SecurityOpt` to specify the security options. Client can utilize this field to specify the arguments for security options while creating new containers. |
| 43 | +## Current Implementations |
23 | 44 |
|
24 |
| -###Support in OCI runtimes |
| 45 | +### Support in Docker |
25 | 46 |
|
26 |
| -Since version 0.3.0 of the OCI runtime specification, a user can specify the `noNewPrivs` boolean flag in the configuration file. |
| 47 | +Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs` |
| 48 | +while creating containers, for example |
| 49 | +`docker run --security-opt=no_new_privs busybox`. |
27 | 50 |
|
28 |
| -More details of OCI implementation can be checked [here](https://github.com/opencontainers/runtime-spec/pull/290). |
| 51 | +Docker provides via their Go api an object named `ContainerCreateConfig` to |
| 52 | +configure container creation parameters. In this object, there is a string |
| 53 | +array `HostConfig.SecurityOpt` to specify the security options. Client can |
| 54 | +utilize this field to specify the arguments for security options while |
| 55 | +creating new containers. |
29 | 56 |
|
30 |
| -###SecurityContext in Kubernetes |
| 57 | +This field did not scale well for the Docker client, so it's suggested that |
| 58 | +Kubernetes does not follow that design. |
31 | 59 |
|
32 |
| -Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` for `PodSpec`. `SecurityContext` objects define the related security options for Kubernetes containers, e.g. selinux options. |
| 60 | +This is not on by default in Docker. |
33 | 61 |
|
34 |
| -While creating a container, kubelet parses the security context object and formats the security option strings for Docker. The security options strings will finally be inserted into `ContainerCreateConfig.HostConfig.SecurityOpt` and passed to Docker. Different Kubernetes runtimes now are using different methods to parse and format the security option strings: |
35 |
| -* method `#getSecurityOpts` in `docker_mager_xxxx.go` for Docker runtime |
36 |
| -* method `#getContainerSecurityOpts` in `docker_container.go` for CRI |
| 62 | +More details of the Docker implementation can be read |
| 63 | +[here](https://github.com/moby/moby/pull/20727) as well as the original |
| 64 | +discussion [here](https://github.com/moby/moby/issues/20329). |
37 | 65 |
|
| 66 | +### Support in rkt |
38 | 67 |
|
39 |
| -##Proposal to support "no new privileges" |
| 68 | +Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt. |
40 | 69 |
|
41 |
| -To support "no new privileges" options in Kubernetes, it is proposed to make the following changes: |
| 70 | +More details of the rkt implementation can be read |
| 71 | +[here](https://github.com/rkt/rkt/pull/2677). |
42 | 72 |
|
43 |
| -###Changes of SecurityContext objects |
| 73 | +### Support in OCI runtimes |
44 | 74 |
|
45 |
| -Add a new bool type field named `noNewPrivileges` to both `SecurityContext` definition and `PodSecurityContext` definition: |
46 |
| -* `noNewPrivileges=true` in `PodSecurityContext` means that all the containers in the pod should be run with `no-new-privileges` enabled. This should be a pod level control of `no-new-privileges` flag. |
47 |
| -* `noNewPrivileges` in `SecurityContext` is a container level control of `no-new-privileges` flag, and can override the pod level `noNewPrivileges` setting. |
| 75 | +Since version 0.3.0 of the OCI runtime specification, a user can specify the |
| 76 | +`noNewPrivs` boolean flag in the configuration file. |
48 | 77 |
|
49 |
| -By default, `noNewPrivileges` is `false`. |
| 78 | +More details of the OCI implementation can be read |
| 79 | +[here](https://github.com/opencontainers/runtime-spec/pull/290). |
50 | 80 |
|
51 |
| -The change of security context API objects requires the update of corresponding Kubernetes documents, need to submit another PR to track this. |
| 81 | +## Existing SecurityContext objects |
52 | 82 |
|
53 |
| -###Changes of docker runtime |
| 83 | +Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` |
| 84 | +for `PodSpec`. `SecurityContext` objects define the related security options |
| 85 | +for Kubernetes containers, e.g. selinux options. |
54 | 86 |
|
55 |
| -When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getSecurityOpts` method in `docker_manager_xxx.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt` |
| 87 | +To support "no new privileges" options in Kubernetes, it is proposed to make |
| 88 | +the following changes: |
56 | 89 |
|
57 |
| -###Changes of CRI runtime |
| 90 | +## Changes of SecurityContext objects |
58 | 91 |
|
59 |
| -When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getContainerSecurityOpts` method in `docker_container.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt` |
| 92 | +Add a new `*bool` type field named `allowPrivilegeEscalation` to the `SecurityContext` |
| 93 | +definition. |
60 | 94 |
|
61 |
| -###Changes of kubectl |
| 95 | +By default, `allowPrivilegeEscalation` will be `nil` at the pod security policy |
| 96 | +level with the following exceptions: |
62 | 97 |
|
63 |
| -This is an additional proposal for kubectl. To improve kubectl user experience, we can add a new flag for kubectl command named `--security-opt`. This flag allows user to create pod with security options configured when using `kubectl run` command. For example, if user issues command like `kubectl run busybox --image=busybox --security-opt=no-new-privileges -- top`, kubernetes shall create a pod with `noNewPrivileges` enabled. |
| 98 | +- when a container is `privileged` |
| 99 | +- when `CAP_SYS_ADMIN` is added to a container |
| 100 | +- when a container is not run as root, uid `0` (to prevent breaking suid |
| 101 | + binaries) |
64 | 102 |
|
65 |
| -If the proposal of kubectl changes is accepted, the patch can also be submitted as a separate PR. |
| 103 | +The API will reject as invalid `privileged=true` and |
| 104 | +`allowPrivilegeEscalation=false`, as well as `capAdd=CAP_SYS_ADMIN` and |
| 105 | +`allowPrivilegeEscalation=false.` |
| 106 | + |
| 107 | +When `allowPrivilegeEscalation` is set to false it will enable `no_new_privs` |
| 108 | +for that container. |
| 109 | + |
| 110 | +`allowPrivilegeEscalation` in `SecurityContext` provides container level |
| 111 | +control of the `no_new_privs` flag and can override the default in both directions |
| 112 | +of the `allowPrivilegeEscalation` setting. |
| 113 | + |
| 114 | +This requires changes to the Docker, rkt, and CRI runtime integrations so that |
| 115 | +kubelet will add the specific `no_new_privs` option. |
| 116 | + |
| 117 | +## Default via Pod Security Policy |
| 118 | + |
| 119 | +The default can be set via a new `*bool` type field named `allowPrivilegeEscalation` |
| 120 | +in a Pod Security Policy. |
| 121 | +This would allow users to set `allowPrivilegeEscalation=false`, overriding the |
| 122 | +default behavior of `no_new_privs=false` for containers |
| 123 | +whose uids are not 0. |
| 124 | + |
| 125 | +This would also keep the behavior of setting `allowPrivilegeEscalation=true` |
| 126 | +for privileged containers and those with `capAdd=CAP_SYS_ADMIN`. |
| 127 | + |
| 128 | +To recap, below is a table defining the default behavior at the pod security |
| 129 | +policy level and what can be set as a default with a pod security policy. |
| 130 | + |
| 131 | +| allowPrivilegeEscalation setting | uid = 0 or unset | uid != 0 | privileged/CAP_SYS_ADMIN | |
| 132 | +|----------------------------------|--------------------|--------------------|--------------------------| |
| 133 | +| nil | no_new_privs=true | no_new_privs=false | no_new_privs=false | |
| 134 | +| false | no_new_privs=true | no_new_privs=true | no_new_privs=false | |
| 135 | +| true | no_new_privs=false | no_new_privs=false | no_new_privs=false | |
0 commit comments