Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 170a771

Browse files
committedMay 24, 2017
seccomp: add default profile
Signed-off-by: Jess Frazelle <[email protected]>
1 parent 4fff7fa commit 170a771

File tree

1 file changed

+115
-8
lines changed

1 file changed

+115
-8
lines changed
 

‎contributors/design-proposals/seccomp.md

+115-8
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ This design should:
2828
* be container-runtime agnostic
2929
* allow use of custom profiles
3030
* facilitate containerized applications that link directly to libseccomp
31+
* enable a default seccomp profile for containers
3132

3233
## Use Cases
3334

@@ -40,14 +41,16 @@ This design should:
4041
unmediated by Kubernetes
4142
4. As a user, I want to be able to use a custom seccomp profile and use
4243
it with my containers
44+
5. As a user and administrator I want kubernetes to apply a sane default
45+
seccomp profile to containers unless I otherwise specify.
4346

44-
### Use Case: Administrator access control
47+
### Use Case: Administrator Access Control
4548

4649
Controlling access to seccomp profiles is a cluster administrator
4750
concern. It should be possible for an administrator to control which users
4851
have access to which profiles.
4952

50-
The [pod security policy](https://github.com/kubernetes/kubernetes/pull/7893)
53+
The [Pod Security Policy](https://github.com/kubernetes/kubernetes/pull/7893)
5154
API extension governs the ability of users to make requests that affect pod
5255
and container security contexts. The proposed design should deal with
5356
required changes to control access to new functionality.
@@ -101,9 +104,9 @@ implement a sandbox for user-provided code, such as
101104

102105
## Community Work
103106

104-
### Container runtime support for seccomp
107+
### Container Runtime Support for Seccomp
105108

106-
#### Docker / opencontainers
109+
#### Docker / OCI
107110

108111
Docker supports the open container initiative's API for
109112
seccomp, which is very close to the libseccomp API. It allows full
@@ -112,13 +115,20 @@ specification of seccomp filters, with arguments, operators, and actions.
112115
Docker allows the specification of a single seccomp filter. There are
113116
community requests for:
114117

115-
Issues:
116-
117118
* [docker/22109](https://github.com/docker/docker/issues/22109): composable
118119
seccomp filters
119120
* [docker/21105](https://github.com/docker/docker/issues/22105): custom
120121
seccomp filters for builds
121122

123+
Implementation details:
124+
125+
* [docker/17989](https://github.com/moby/moby/pull/17989): initial
126+
implementation
127+
* [docker/18780](https://github.com/moby/moby/pull/18780): default blacklist
128+
profile
129+
* [docker/18979](https://github.com/moby/moby/pull/18979): default whitelist
130+
profile
131+
122132
#### rkt / appcontainers
123133

124134
The `rkt` runtime delegates to systemd for seccomp support; there is an open
@@ -168,8 +178,6 @@ Instead of implementing a new API resource, we propose that pods be able to
168178
reference seccomp profiles by name. Since this is an alpha feature, we will
169179
use annotations instead of extending the API with new fields.
170180

171-
### API changes?
172-
173181
In the alpha version of this feature we will use annotations to store the
174182
names of seccomp profiles. The keys will be:
175183

@@ -214,6 +222,105 @@ profiles using the key
214222
`seccomp.security.alpha.kubernetes.io/allowedProfileNames`. The value of this
215223
key should be a comma delimited list.
216224

225+
### Default Profile
226+
227+
We will create our own default seccomp profile for containers and use the
228+
docker default as the initial base. Having our own will allow us to control and
229+
restrict different syscalls in the future.
230+
231+
The default profile will match the spec of Docker's profile but in the future
232+
it can evolve to that of the OCI spec once CRI is complete and we move into
233+
a beta with a seccomp spec of our own.
234+
235+
#### Various Syscalls Not Allowed
236+
237+
Below includes a table of some of the syscalls we will not allow in our
238+
whitelist and why. It does not include all the syscalls but merely some
239+
important ones. Most of this was taken from the
240+
[original pull request](https://github.com/moby/moby/pull/19059) to Docker
241+
for the default profile.
242+
243+
| Syscall | Description |
244+
|---------------------|---------------------------------------------------------------------------------------------------------------------------------------|
245+
| `acct` | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_PACCT`. |
246+
| `add_key` | Prevent containers from using the kernel keyring, which is not namespaced. |
247+
| `adjtimex` | Similar to `clock_settime` and `settimeofday`, time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
248+
| `bpf` | Deny loading potentially persistent bpf programs into kernel, already gated by `CAP_SYS_ADMIN`. |
249+
| `clock_adjtime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
250+
| `clock_settime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
251+
| `clone` | Deny cloning new namespaces. Also gated by `CAP_SYS_ADMIN` for CLONE_* flags, except `CLONE_USERNS`. |
252+
| `create_module` | Deny manipulation and functions on kernel modules. Obsolete. Also gated by `CAP_SYS_MODULE`. |
253+
| `delete_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. |
254+
| `finit_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. |
255+
| `get_kernel_syms` | Deny retrieval of exported kernel and module symbols. Obsolete. |
256+
| `get_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. |
257+
| `init_module` | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`. |
258+
| `ioperm` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. |
259+
| `iopl` | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`. |
260+
| `kcmp` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. |
261+
| `kexec_file_load` | Sister syscall of `kexec_load` that does the same thing, slightly different arguments. Also gated by `CAP_SYS_BOOT`. |
262+
| `kexec_load` | Deny loading a new kernel for later execution. Also gated by `CAP_SYS_BOOT`. |
263+
| `keyctl` | Prevent containers from using the kernel keyring, which is not namespaced. |
264+
| `lookup_dcookie` | Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by `CAP_SYS_ADMIN`. |
265+
| `mbind` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. |
266+
| `mount` | Deny mounting, already gated by `CAP_SYS_ADMIN`. |
267+
| `move_pages` | Syscall that modifies kernel memory and NUMA settings. |
268+
| `name_to_handle_at` | Sister syscall to `open_by_handle_at`. Already gated by `CAP_SYS_NICE`. |
269+
| `nfsservctl` | Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1. |
270+
| `open_by_handle_at` | Cause of an old container breakout. Also gated by `CAP_DAC_READ_SEARCH`. |
271+
| `perf_event_open` | Tracing/profiling syscall, which could leak a lot of information on the host. |
272+
| `personality` | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. |
273+
| `pivot_root` | Deny `pivot_root`, should be privileged operation. |
274+
| `process_vm_readv` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. |
275+
| `process_vm_writev` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`. |
276+
| `ptrace` | Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping `CAP_PTRACE`. |
277+
| `query_module` | Deny manipulation and functions on kernel modules. Obsolete. |
278+
| `quotactl` | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_ADMIN`. |
279+
| `reboot` | Don't let containers reboot the host. Also gated by `CAP_SYS_BOOT`. |
280+
| `request_key` | Prevent containers from using the kernel keyring, which is not namespaced. |
281+
| `set_mempolicy` | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`. |
282+
| `setns` | Deny associating a thread with a namespace. Also gated by `CAP_SYS_ADMIN`. |
283+
| `settimeofday` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`.
284+
| `socket`, `socketcall` | Used to send or receive packets and for other socket operations. All `socket` and `socketcall` calls are blocked except communication domains `AF_UNIX`, `AF_INET`, `AF_INET6`, `AF_NETLINK`, and `AF_PACKET`. |
285+
| `stime` | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`. |
286+
| `swapon` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. |
287+
| `swapoff` | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`. |
288+
| `sysfs` | Obsolete syscall. |
289+
| `_sysctl` | Obsolete, replaced by /proc/sys. |
290+
| `umount` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. |
291+
| `umount2` | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`. |
292+
| `unshare` | Deny cloning new namespaces for processes. Also gated by `CAP_SYS_ADMIN`, with the exception of `unshare --user`. |
293+
| `uselib` | Older syscall related to shared libraries, unused for a long time. |
294+
| `userfaultfd` | Userspace page fault handling, largely needed for process migration. |
295+
| `ustat` | Obsolete syscall. |
296+
| `vm86` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. |
297+
| `vm86old` | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`. |
298+
299+
#### Default Behavior
300+
301+
For `privileged` containers, no default seccomp profile will be used unless
302+
explicitly requested by the user via annotations.
303+
304+
If `capAdd` is used on a Container, the default profile will be adjusted to
305+
interact accordingly with the capability added. These are documented below in
306+
a table by the cap being added:
307+
308+
| Capability | Syscalls Allowed |
309+
|----------------------|--------------------------------------------------------------|
310+
| `CAP_CHOWN` | chown, chown32, fchown, fchown32, fchownat, lchown, lchown32 |
311+
| `CAP_DAC_READ_SEARCH`| open_by_handle_at |
312+
| `CAP_IPC_LOCK` | mlock, mlock2, mlockall |
313+
| `CAP_SYS_ADMIN` | name_to_handle_at, bpf, clone, fanotify_init, lookup_dcookie, mount, perf_event_open, setdomainname, sethostname, setns, umount, umount2, unshare |
314+
| `CAP_SYS_BOOT` | reboot |
315+
| `CAP_SYS_CHROOT` | chroot |
316+
| `CAP_SYS_MODULE` | delete_module, init_module, finit_module, query_module |
317+
| `CAP_SYS_PACCT` | acct |
318+
| `CAP_SYS_PTRACE` | kcmp, process_vm_readv, process_vm_writev, ptrace |
319+
| `CAP_SYS_RAWIO` | iopl, ioperm |
320+
| `CAP_SYS_TIME` | settimeofday, stime, adjtimex, clock_settime |
321+
| `CAP_SYS_TTY_CONFIG` | vhangup |
322+
323+
217324
## Examples
218325

219326
### Unconfined profile

0 commit comments

Comments
 (0)
Please sign in to comment.