Skip to content

Commit 3d96e91

Browse files
committed
update no new privs proposal
Signed-off-by: Jess Frazelle <[email protected]>
1 parent 3bd5f9f commit 3d96e91

File tree

1 file changed

+104
-34
lines changed

1 file changed

+104
-34
lines changed
+104-34
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,135 @@
1-
#Support "no new privileges" in Kubernetes
1+
# No New Privileges
22

3-
##Description
3+
- [Description](#description)
4+
* [Interactions with other Linux primitives](#interactions-with-other-linux-primitives)
5+
- [Current Implementations](#current-implementations)
6+
* [Support in Docker](#support-in-docker)
7+
* [Support in rkt](#support-in-rkt)
8+
* [Support in OCI runtimes](#support-in-oci-runtimes)
9+
- [Existing SecurityContext objects](#existing-securitycontext-objects)
10+
- [Changes of SecurityContext objects](#changes-of-securitycontext-objects)
11+
- [Default via Pod Security Policy](#default-via-pod-security-policy)
412

5-
In Linux, the `execve` system call can grant more privileges to a newly-created process than its parent process. Considering security issues, since Linux kernel v3.5, there is a new flag named `no_new_privs` added to prevent those new privileges from being granted to the processes.
613

7-
`no_new_privs` is inherited across `fork`, `clone` and `execve` and can not be unset. With `no_new_privs` set, `execve` promises not to grant the privilege to do anything that could not have been done without the `execve` call.
14+
## Description
815

9-
For more details about `no_new_privs`, please check the Linux kernel document [here](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt).
16+
In Linux, the `execve` system call can grant more privileges to a newly-created
17+
process than its parent process. Considering security issues, since Linux kernel
18+
v3.5, there is a new flag named `no_new_privs` added to prevent those new
19+
privileges from being granted to the processes.
1020

11-
Docker started to support `no_new_privs` option since 1.11. Here is the [link](https://github.com/docker/docker/issues/20329) of the ticket in Docker community to support `no_new_privs` option.
21+
[`no_new_privs`](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt)
22+
is inherited across `fork`, `clone` and `execve` and can not be unset. With
23+
`no_new_privs` set, `execve` promises not to grant the privilege to do anything
24+
that could not have been done without the `execve` call.
1225

13-
We want to support the creation of containers with `no_new_privs` enabled in Kubernetes, which will make the Kubernetes cluster more safe. Here is the [link](https://github.com/kubernetes/kubernetes/issues/38417) of the ticket in Kubernetes community to track this proposal.
26+
For more details about `no_new_privs`, please check the
27+
[Linux kernel documention](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt).
1428

29+
This is different from `NOSUID` in that `no_new_privs`can give permission to
30+
the container process to further restrict child processes with seccomp. This
31+
permission goes only one-way in that the container process can not grant more
32+
permissions, only further restrict.
1533

16-
##Current implementation
34+
### Interactions with other Linux primitives
1735

18-
###Support in Docker
36+
- suid binaries: will break when `no_new_privs` is enabled
37+
- seccomp2 as a non root user: requires `no_new_privs`
38+
- seccomp2 with dropped `CAP_SYS_ADMIN`: requires `no_new_privs`
39+
- ambient capabilities: requires `no_new_privs`
40+
- selinux transitions: bugs that were fixed documented [here](https://github.com/moby/moby/issues/23981#issuecomment-233121969)
1941

20-
Since Docker 1.11, user can specify `--security-opt` to enable `no_new_privs` while creating containers, e.g. `docker run --security-opt=no-new-privileges busybox`
2142

22-
For program client, Docker provides an object named `ContainerCreateConfig` defined in package `github.com/docker/engine-api/types` to config container creation parameters. In this object, there is a string array `HostConfig.SecurityOpt` to specify the security options. Client can utilize this field to specify the arguments for security options while creating new containers.
43+
## Current Implementations
2344

24-
###Support in OCI runtimes
45+
### Support in Docker
2546

26-
Since version 0.3.0 of the OCI runtime specification, a user can specify the `noNewPrivs` boolean flag in the configuration file.
47+
Since Docker 1.11, a user can specify `--security-opt` to enable `no_new_privs`
48+
while creating containers, for example
49+
`docker run --security-opt=no_new_privs busybox`.
2750

28-
More details of OCI implementation can be checked [here](https://github.com/opencontainers/runtime-spec/pull/290).
51+
Docker provides via their Go api an object named `ContainerCreateConfig` to
52+
configure container creation parameters. In this object, there is a string
53+
array `HostConfig.SecurityOpt` to specify the security options. Client can
54+
utilize this field to specify the arguments for security options while
55+
creating new containers.
2956

30-
###SecurityContext in Kubernetes
57+
This field did not scale well for the Docker client, so it's suggested that
58+
Kubernetes does not follow that design.
3159

32-
Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext` for `PodSpec`. `SecurityContext` objects define the related security options for Kubernetes containers, e.g. selinux options.
60+
This is not on by default in Docker.
3361

34-
While creating a container, kubelet parses the security context object and formats the security option strings for Docker. The security options strings will finally be inserted into `ContainerCreateConfig.HostConfig.SecurityOpt` and passed to Docker. Different Kubernetes runtimes now are using different methods to parse and format the security option strings:
35-
* method `#getSecurityOpts` in `docker_mager_xxxx.go` for Docker runtime
36-
* method `#getContainerSecurityOpts` in `docker_container.go` for CRI
62+
More details of the Docker implementation can be read
63+
[here](https://github.com/moby/moby/pull/20727) as well as the original
64+
discussion [here](https://github.com/moby/moby/issues/20329).
3765

66+
### Support in rkt
3867

39-
##Proposal to support "no new privileges"
68+
Since rkt v1.26.0, the `NoNewPrivileges` option has been enabled in rkt.
4069

41-
To support "no new privileges" options in Kubernetes, it is proposed to make the following changes:
70+
More details of the rkt implementation can be read
71+
[here](https://github.com/rkt/rkt/pull/2677).
4272

43-
###Changes of SecurityContext objects
73+
### Support in OCI runtimes
4474

45-
Add a new bool type field named `noNewPrivileges` to both `SecurityContext` definition and `PodSecurityContext` definition:
46-
* `noNewPrivileges=true` in `PodSecurityContext` means that all the containers in the pod should be run with `no-new-privileges` enabled. This should be a pod level control of `no-new-privileges` flag.
47-
* `noNewPrivileges` in `SecurityContext` is a container level control of `no-new-privileges` flag, and can override the pod level `noNewPrivileges` setting.
75+
Since version 0.3.0 of the OCI runtime specification, a user can specify the
76+
`noNewPrivs` boolean flag in the configuration file.
4877

49-
By default, `noNewPrivileges` is `false`.
78+
More details of the OCI implementation can be read
79+
[here](https://github.com/opencontainers/runtime-spec/pull/290).
5080

51-
The change of security context API objects requires the update of corresponding Kubernetes documents, need to submit another PR to track this.
81+
## Existing SecurityContext objects
5282

53-
###Changes of docker runtime
83+
Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext`
84+
for `PodSpec`. `SecurityContext` objects define the related security options
85+
for Kubernetes containers, e.g. selinux options.
5486

55-
When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getSecurityOpts` method in `docker_manager_xxx.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt`
87+
To support "no new privileges" options in Kubernetes, it is proposed to make
88+
the following changes:
5689

57-
###Changes of CRI runtime
90+
## Changes of SecurityContext objects
5891

59-
When parsing the new `SecurityContext` object, kubelet has to take care of `noNewPrivileges` field from security context objects. Once `noNewPrivileges` is `true`, kubelet needs to change `#getContainerSecurityOpts` method in `docker_container.go` to add `no-new-privileges` option to `ContainerCreateConfig.HostConfig.SecurityOpt`
92+
Add a new `*bool` type field named `allowPrivilegeEscalation` to the `SecurityContext`
93+
definition.
6094

61-
###Changes of kubectl
95+
By default, `allowPrivilegeEscalation` will be `nil` at the pod security policy
96+
level with the following exceptions:
6297

63-
This is an additional proposal for kubectl. To improve kubectl user experience, we can add a new flag for kubectl command named `--security-opt`. This flag allows user to create pod with security options configured when using `kubectl run` command. For example, if user issues command like `kubectl run busybox --image=busybox --security-opt=no-new-privileges -- top`, kubernetes shall create a pod with `noNewPrivileges` enabled.
98+
- when a container is `privileged`
99+
- when `CAP_SYS_ADMIN` is added to a container
100+
- when a container is not run as root, uid `0` (to prevent breaking suid
101+
binaries)
64102

65-
If the proposal of kubectl changes is accepted, the patch can also be submitted as a separate PR.
103+
The API will reject as invalid `privileged=true` and
104+
`allowPrivilegeEscalation=false`, as well as `capAdd=CAP_SYS_ADMIN` and
105+
`allowPrivilegeEscalation=false.`
106+
107+
When `allowPrivilegeEscalation` is set to false it will enable `no_new_privs`
108+
for that container.
109+
110+
`allowPrivilegeEscalation` in `SecurityContext` provides container level
111+
control of the `no_new_privs` flag and can override the default in both directions
112+
of the `allowPrivilegeEscalation` setting.
113+
114+
This requires changes to the Docker, rkt, and CRI runtime integrations so that
115+
kubelet will add the specific `no_new_privs` option.
116+
117+
## Default via Pod Security Policy
118+
119+
The default can be set via a new `*bool` type field named `allowPrivilegeEscalation`
120+
in a Pod Security Policy.
121+
This would allow users to set `allowPrivilegeEscalation=false`, overriding the
122+
default behavior of `no_new_privs=false` for containers
123+
whose uids are not 0.
124+
125+
This would also keep the behavior of setting `allowPrivilegeEscalation=true`
126+
for privileged containers and those with `capAdd=CAP_SYS_ADMIN`.
127+
128+
To recap, below is a table defining the default behavior at the pod security
129+
policy level and what can be set as a default with a pod security policy.
130+
131+
| allowPrivilegeEscalation setting | uid = 0 or unset | uid != 0 | privileged/CAP_SYS_ADMIN |
132+
|----------------------------------|--------------------|--------------------|--------------------------|
133+
| nil | no_new_privs=true | no_new_privs=false | no_new_privs=false |
134+
| false | no_new_privs=true | no_new_privs=true | no_new_privs=false |
135+
| true | no_new_privs=false | no_new_privs=false | no_new_privs=false |

0 commit comments

Comments
 (0)