Skip to content

Commit 1eaf775

Browse files
committed
seccomp: add default profile
Signed-off-by: Jess Frazelle <[email protected]>
1 parent 4fff7fa commit 1eaf775

File tree

1 file changed

+237
-21
lines changed

1 file changed

+237
-21
lines changed

contributors/design-proposals/seccomp.md

+237-21
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,30 @@
1+
# Seccomp
2+
3+
- [Abstract](#abstract)
4+
- [Motivation](#motivation)
5+
- [Constraints and Assumptions](#constraints-and-assumptions)
6+
- [Use Cases](#use-cases)
7+
* [Use Case: Administrator Access Control](#use-case--administrator-access-control)
8+
* [Use Case: Seccomp profiles similar to container runtime defaults](#use-case--seccomp-profiles-similar-to-container-runtime-defaults)
9+
* [Use Case: Applications that link to libseccomp](#use-case--applications-that-link-to-libseccomp)
10+
* [Use Case: Custom profiles](#use-case--custom-profiles)
11+
- [Community Work](#community-work)
12+
* [Docker / OCI](#docker---oci)
13+
* [rkt / appcontainers](#rkt---appcontainers)
14+
* [HyperContainer](#hypercontainer)
15+
* [lxd](#lxd)
16+
* [Other platforms and seccomp-like capabilities](#other-platforms-and-seccomp-like-capabilities)
17+
- [Proposed Design](#proposed-design)
18+
* [Seccomp API Resource?](#seccomp-api-resource-)
19+
* [Pod Security Policy annotation](#pod-security-policy-annotation)
20+
* [Spec](#spec)
21+
* [Default Profile](#default-profile)
22+
+ [Various Syscalls Not Allowed](#various-syscalls-not-allowed)
23+
+ [Default Behavior](#default-behavior)
24+
- [Examples](#examples)
25+
* [Unconfined profile](#unconfined-profile)
26+
* [Custom profile](#custom-profile)
27+
128
## Abstract
229

330
A proposal for adding **alpha** support for
@@ -28,6 +55,7 @@ This design should:
2855
* be container-runtime agnostic
2956
* allow use of custom profiles
3057
* facilitate containerized applications that link directly to libseccomp
58+
* enable a default seccomp profile for containers
3159

3260
## Use Cases
3361

@@ -40,14 +68,16 @@ This design should:
4068
unmediated by Kubernetes
4169
4. As a user, I want to be able to use a custom seccomp profile and use
4270
it with my containers
71+
5. As a user and administrator I want kubernetes to apply a sane default
72+
seccomp profile to containers unless I otherwise specify.
4373

44-
### Use Case: Administrator access control
74+
### Use Case: Administrator Access Control
4575

4676
Controlling access to seccomp profiles is a cluster administrator
4777
concern. It should be possible for an administrator to control which users
4878
have access to which profiles.
4979

50-
The [pod security policy](https://github.com/kubernetes/kubernetes/pull/7893)
80+
The [Pod Security Policy](https://github.com/kubernetes/kubernetes/pull/7893)
5181
API extension governs the ability of users to make requests that affect pod
5282
and container security contexts. The proposed design should deal with
5383
required changes to control access to new functionality.
@@ -101,9 +131,7 @@ implement a sandbox for user-provided code, such as
101131

102132
## Community Work
103133

104-
### Container runtime support for seccomp
105-
106-
#### Docker / opencontainers
134+
### Docker / OCI
107135

108136
Docker supports the open container initiative's API for
109137
seccomp, which is very close to the libseccomp API. It allows full
@@ -112,14 +140,21 @@ specification of seccomp filters, with arguments, operators, and actions.
112140
Docker allows the specification of a single seccomp filter. There are
113141
community requests for:
114142

115-
Issues:
116-
117143
* [docker/22109](https://github.com/docker/docker/issues/22109): composable
118144
seccomp filters
119145
* [docker/21105](https://github.com/docker/docker/issues/22105): custom
120146
seccomp filters for builds
121147

122-
#### rkt / appcontainers
148+
Implementation details:
149+
150+
* [docker/17989](https://github.com/moby/moby/pull/17989): initial
151+
implementation
152+
* [docker/18780](https://github.com/moby/moby/pull/18780): default blacklist
153+
profile
154+
* [docker/18979](https://github.com/moby/moby/pull/18979): default whitelist
155+
profile
156+
157+
### rkt / appcontainers
123158

124159
The `rkt` runtime delegates to systemd for seccomp support; there is an open
125160
issue to add support once `appc` supports it. The `appc` project has an open
@@ -133,23 +168,24 @@ Issues:
133168
* [appc/529](https://github.com/appc/spec/issues/529)
134169
* [rkt/1614](https://github.com/coreos/rkt/issues/1614)
135170

136-
#### HyperContainer
171+
### HyperContainer
137172

138173
[HyperContainer](https://hypercontainer.io) does not support seccomp.
139174

140-
### Other platforms and seccomp-like capabilities
141-
142-
FreeBSD has a seccomp/capability-like facility called
143-
[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4).
144-
145-
#### lxd
175+
### lxd
146176

147177
[`lxd`](http://www.ubuntu.com/cloud/lxd) constrains containers using a default profile.
148178

149179
Issues:
150180

151181
* [lxd/1084](https://github.com/lxc/lxd/issues/1084): add knobs for seccomp
152182

183+
### Other platforms and seccomp-like capabilities
184+
185+
FreeBSD has a seccomp/capability-like facility called
186+
[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4).
187+
188+
153189
## Proposed Design
154190

155191
### Seccomp API Resource?
@@ -168,8 +204,6 @@ Instead of implementing a new API resource, we propose that pods be able to
168204
reference seccomp profiles by name. Since this is an alpha feature, we will
169205
use annotations instead of extending the API with new fields.
170206

171-
### API changes?
172-
173207
In the alpha version of this feature we will use annotations to store the
174208
names of seccomp profiles. The keys will be:
175209

@@ -214,6 +248,192 @@ profiles using the key
214248
`seccomp.security.alpha.kubernetes.io/allowedProfileNames`. The value of this
215249
key should be a comma delimited list.
216250

251+
### Spec
252+
253+
We will start from the OCI specification. This API resource will be added to
254+
`settings.k8s.io` as an `alpha` resource.
255+
256+
```
257+
// Seccomp represents syscall restrictions
258+
type Seccomp struct {
259+
unversioned.TypeMeta
260+
ObjectMeta
261+
262+
// +optional
263+
Spec SeccompSpec
264+
}
265+
266+
// SeccompSpec represents the spec for syscall restrictions
267+
type SeccompSpec struct {
268+
DefaultAction Action `json:"defaultAction"`
269+
Architectures []Arch `json:"architectures,omitempty"`
270+
Syscalls []Syscall `json:"syscalls,omitempty"`
271+
}
272+
273+
// Arch used for additional architectures
274+
type Arch string
275+
276+
// Additional architectures permitted to be used for system calls
277+
// By default only the native architecture of the kernel is permitted
278+
const (
279+
ArchX86 Arch = "SCMP_ARCH_X86"
280+
ArchX86_64 Arch = "SCMP_ARCH_X86_64"
281+
ArchX32 Arch = "SCMP_ARCH_X32"
282+
ArchARM Arch = "SCMP_ARCH_ARM"
283+
ArchAARCH64 Arch = "SCMP_ARCH_AARCH64"
284+
ArchMIPS Arch = "SCMP_ARCH_MIPS"
285+
ArchMIPS64 Arch = "SCMP_ARCH_MIPS64"
286+
ArchMIPS64N32 Arch = "SCMP_ARCH_MIPS64N32"
287+
ArchMIPSEL Arch = "SCMP_ARCH_MIPSEL"
288+
ArchMIPSEL64 Arch = "SCMP_ARCH_MIPSEL64"
289+
ArchMIPSEL64N32 Arch = "SCMP_ARCH_MIPSEL64N32"
290+
ArchPPC Arch = "SCMP_ARCH_PPC"
291+
ArchPPC64 Arch = "SCMP_ARCH_PPC64"
292+
ArchPPC64LE Arch = "SCMP_ARCH_PPC64LE"
293+
ArchS390 Arch = "SCMP_ARCH_S390"
294+
ArchS390X Arch = "SCMP_ARCH_S390X"
295+
ArchPARISC Arch = "SCMP_ARCH_PARISC"
296+
ArchPARISC64 Arch = "SCMP_ARCH_PARISC64"
297+
)
298+
299+
// SeccompAction taken upon Seccomp rule match
300+
type SeccompAction string
301+
302+
// Define actions for Seccomp rules
303+
const (
304+
ActKill SeccompAction = "SCMP_ACT_KILL"
305+
ActTrap SeccompAction = "SCMP_ACT_TRAP"
306+
ActErrno SeccompAction = "SCMP_ACT_ERRNO"
307+
ActTrace SeccompAction = "SCMP_ACT_TRACE"
308+
ActAllow SeccompAction = "SCMP_ACT_ALLOW"
309+
)
310+
311+
// SeccompOperator used to match syscall arguments in Seccomp
312+
type SeccompOperator string
313+
314+
// Define operators for syscall arguments in Seccomp
315+
const (
316+
OpNotEqual SeccompOperator = "SCMP_CMP_NE"
317+
OpLessThan SeccompOperator = "SCMP_CMP_LT"
318+
OpLessEqual SeccompOperator = "SCMP_CMP_LE"
319+
OpEqualTo SeccompOperator = "SCMP_CMP_EQ"
320+
OpGreaterEqual SeccompOperator = "SCMP_CMP_GE"
321+
OpGreaterThan SeccompOperator = "SCMP_CMP_GT"
322+
OpMaskedEqual SeccompOperator = "SCMP_CMP_MASKED_EQ"
323+
)
324+
325+
// SeccompArg used for matching specific syscall arguments in Seccomp
326+
type SeccompArg struct {
327+
Index uint `json:"index"`
328+
Value uint64 `json:"value"`
329+
ValueTwo uint64 `json:"valueTwo"`
330+
Op SeccompOperator `json:"op"`
331+
}
332+
333+
// Syscall is used to match a syscall in Seccomp
334+
type Syscall struct {
335+
Names []string `json:"names"`
336+
Action SeccompAction `json:"action"`
337+
Args []SeccompArg `json:"args,omitempty"`
338+
}
339+
```
340+
341+
### Default Profile
342+
343+
We will create our own default seccomp profile that uses the above spec
344+
for containers and use the set of syscalls from the docker default profile
345+
as the initial base. Having our own will allow us to control and
346+
restrict different syscalls in the future.
347+
348+
#### Various Syscalls Not Allowed
349+
350+
Below includes a table of some of the syscalls we will not allow in our
351+
whitelist and why. It does not include all the syscalls but merely some
352+
important ones. Most of this was taken from the
353+
[original pull request](https://github.com/moby/moby/pull/19059) to Docker
354+
for the default profile.
355+
356+
| Syscall | Description |
357+
|---------------------|---------------------------------------------------------------------------------------------------------------------------------------|
358+
| `acct` | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_PACCT`. |
359+
| `add_key` | Prevent containers from using the kernel keyring, which is not namespaced. |
360+
| `adjtimex` | Similar to `clock_settime` and `settimeofday`, time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
361+
| `bpf` | Deny loading potentially persistent bpf programs into kernel, already gated by `CAP_SYS_ADMIN`. |
362+
| `clock_adjtime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
363+
| `clock_settime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
364+
| `clone` | Deny cloning new namespaces. Also gated by `CAP_SYS_ADMIN` for CLONE_* flags, except `CLONE_USERNS`. |
365+
| `create_module` | Deny manipulation and functions on kernel modules. Obsolete. Also gated by `CAP_SYS_MODULE`. |
366+
| `delete_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. |
367+
| `finit_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. |
368+
| `get_kernel_syms` | Deny retrieval of exported kernel and module symbols. Obsolete. |
369+
| `get_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. |
370+
| `init_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. |
371+
| `ioperm` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. |
372+
| `iopl` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. |
373+
| `kcmp` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. |
374+
| `kexec_file_load` | Sister syscall of `kexec_load` that does the same thing, slightly different arguments. Also gated by `CAP_SYS_BOOT`. |
375+
| `kexec_load` | Deny loading a new kernel for later execution. Also gated by `CAP_SYS_BOOT`. |
376+
| `keyctl` | Prevent containers from using the kernel keyring, which is not namespaced. |
377+
| `lookup_dcookie` | Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by `CAP_SYS_ADMIN`. |
378+
| `mbind` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. |
379+
| `mount` | Deny mounting, already gated by `CAP_SYS_ADMIN`. |
380+
| `move_pages` | Syscall that modifies kernel memory and NUMA settings. |
381+
| `name_to_handle_at` | Sister syscall to `open_by_handle_at`. Already gated by `CAP_SYS_NICE`. |
382+
| `nfsservctl` | Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1. |
383+
| `open_by_handle_at` | Cause of an old container breakout. Also gated by `CAP_DAC_READ_SEARCH`. |
384+
| `perf_event_open` | Tracing/profiling syscall, which could leak a lot of information on the host. |
385+
| `personality` | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. |
386+
| `pivot_root` | Deny `pivot_root`, should be privileged operation. |
387+
| `process_vm_readv` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. |
388+
| `process_vm_writev` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. |
389+
| `ptrace` | Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping `CAP_PTRACE`. |
390+
| `query_module` | Deny manipulation and functions on kernel modules. Obsolete. |
391+
| `quotactl` | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_ADMIN`. |
392+
| `reboot` | Don't let containers reboot the host. Also gated by `CAP_SYS_BOOT`. |
393+
| `request_key` | Prevent containers from using the kernel keyring, which is not namespaced. |
394+
| `set_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. |
395+
| `setns` | Deny associating a thread with a namespace. Also gated by `CAP_SYS_ADMIN`. |
396+
| `settimeofday` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`.
397+
| `socket`, `socketcall` | Used to send or receive packets and for other socket operations. All `socket` and `socketcall` calls are blocked except communication domains `AF_UNIX`, `AF_INET`, `AF_INET6`, `AF_NETLINK`, and `AF_PACKET`. |
398+
| `stime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
399+
| `swapon` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. |
400+
| `swapoff` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. |
401+
| `sysfs` | Obsolete syscall. |
402+
| `_sysctl` | Obsolete, replaced by /proc/sys. |
403+
| `umount` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. |
404+
| `umount2` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. |
405+
| `unshare` | Deny cloning new namespaces for processes. Also gated by `CAP_SYS_ADMIN`, with the exception of `unshare --user`. |
406+
| `uselib` | Older syscall related to shared libraries, unused for a long time. |
407+
| `userfaultfd` | Userspace page fault handling, largely needed for process migration. |
408+
| `ustat` | Obsolete syscall. |
409+
| `vm86` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. |
410+
| `vm86old` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. |
411+
412+
#### Default Behavior
413+
414+
For `privileged` containers, no default seccomp profile will be used unless
415+
explicitly requested by the user via annotations.
416+
417+
If `capAdd` is used on a Container, the default profile will be adjusted to
418+
interact accordingly with the capability added. These are documented below in
419+
a table by the cap being added:
420+
421+
| Capability | Syscalls Allowed |
422+
|----------------------|--------------------------------------------------------------|
423+
| `CAP_CHOWN` | chown, chown32, fchown, fchown32, fchownat, lchown, lchown32 |
424+
| `CAP_DAC_READ_SEARCH`| open_by_handle_at |
425+
| `CAP_IPC_LOCK` | mlock, mlock2, mlockall |
426+
| `CAP_SYS_ADMIN` | name_to_handle_at, bpf, clone, fanotify_init, lookup_dcookie, mount, perf_event_open, setdomainname, sethostname, setns, umount, umount2, unshare |
427+
| `CAP_SYS_BOOT` | reboot |
428+
| `CAP_SYS_CHROOT` | chroot |
429+
| `CAP_SYS_MODULE` | delete_module, init_module, finit_module, query_module |
430+
| `CAP_SYS_PACCT` | acct |
431+
| `CAP_SYS_PTRACE` | kcmp, process_vm_readv, process_vm_writev, ptrace |
432+
| `CAP_SYS_RAWIO` | iopl, ioperm |
433+
| `CAP_SYS_TIME` | settimeofday, stime, adjtimex, clock_settime |
434+
| `CAP_SYS_TTY_CONFIG` | vhangup |
435+
436+
217437
## Examples
218438

219439
### Unconfined profile
@@ -260,7 +480,3 @@ spec:
260480
- name: test-volume
261481
emptyDir: {}
262482
```
263-
264-
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
265-
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]()
266-
<!-- END MUNGE: GENERATED_ANALYTICS -->

0 commit comments

Comments
 (0)