Skip to content

Commit f803f7b

Browse files
committed
Rework to use gRPC between kubelet and pod
1 parent 1b1d02d commit f803f7b

File tree

1 file changed

+87
-32
lines changed

1 file changed

+87
-32
lines changed

contributors/design-proposals/containerized-mounter-pod.md

+87-32
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,21 @@ Kubernetes should be able to run all utilities that are needed to provision/atta
55

66
## Secondary objectives
77
These are not requirements per se, just things to consider before drawing the final design.
8-
* CNCF designs Container Storage Interface (CSI). So far, this CSI expects that "volume plugins" on each host are long-running processes with a fixed (gRPC?) API. We should aim the same direction, using exec instead of gRPC, hoping to switch to CSI when it's ready. In other words, there should be one long-running container for a volume plugin that serves all volumes of given type on a host.
8+
* CNCF designs Container Storage Interface (CSI). So far, this CSI expects that "volume plugins" on each host are long-running processes with a fixed gRPC API. We should aim the same direction, hoping to switch to CSI when it's ready. In other words, there should be one long-running container for a volume plugin that serves all volumes of given type on a host.
99
* We should try to avoid complicated configuration. The system should work out of the box or with very limited configuration.
1010

1111
## Terminology
1212

13-
**Mount utilities** for a volume pluigin are all tools that are necessary to use a volume plugin. This includes not only utilities needed to *mount* the filesystem (e.g. `mount.glusterfs` for Gluster), but also utilities needed to attach, detach, provision or delete the volume, such as `/usr/bin/rbd` for Ceph RBD.
13+
**Mount utilities** for a volume plugin are all tools that are necessary to use a volume plugin. This includes not only utilities needed to *mount* the filesystem (e.g. `mount.glusterfs` for Gluster), but also utilities needed to attach, detach, provision or delete the volume, such as `/usr/bin/rbd` for Ceph RBD.
1414

1515
## User story
1616
Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes.
17-
1. Admin installs Kubernetes in any way.
18-
2. Admin runs Kubernetes as usual. There are new command line options described below, but they will have sane defaults so no configuration is necessary in most cases.
19-
* During alpha incubation, kubelet command line option `--experimental-mount-namespace=kube-mount` **must be used** to enable this feature and to tell Kubernetes where to looks for pods with mount utilities. This option will default to `kube-mount` after alpha.
20-
3. Admin deploys a DaemonSet that runs a pod with `mount.glusterfs` on each node in namespace `kube-mount`. In future, this could be done by installer.
21-
4. User creates a pod that uses a GlusterFS volume. Kubelet finds a pod with mount utilities on the node and uses it to mount the volume instead of expecting that `mount.glusterfs` is available on the host.
17+
1. Admin installs and runs Kubernetes in any way.
18+
1. Admin deploys a DaemonSet that runs a pod with `mount.glusterfs` on each node. In future, this could be done by installer.
19+
1. User creates a pod that uses a GlusterFS volume. Kubelet finds a pod with mount utilities on the node and uses it to mount the volume instead of expecting that `mount.glusterfs` is available on the host.
2220

2321
- User does not need to configure anything and sees the pod Running as usual.
24-
- Admin needs to deploy the DaemonSet and configure Kubernetes a bit.
22+
- Admin just needs to deploy the DaemonSet.
2523
- It's quite hard to update the DaemonSet, see below.
2624

2725
## Alternatives
@@ -67,17 +65,16 @@ Disadvantages:
6765
## Requirements on DaemonSets with mount utilities
6866
These are rules that need to be followed by DaemonSet authors:
6967
* One DaemonSet can serve mount utilities for one or more volume plugins. We expect that one volume plugin per DaemonSet will be the most popular choice.
70-
* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for one volume plugin, including `mkfs` and `fsck` utilities if they're needed.
68+
* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed.
7169
* E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
72-
* The only exception are kernel modules. They are not portable across distros and they should be on the host.
70+
* The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
7371
* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.
7472
* The pods with mount utilities should run some simple init as PID 1 that reaps zombies of potential fuse daemons.
75-
* To allow Kubernetes to discover these pods with mount utilities:
76-
* All DaemonSets for all chosen volume plugins must run in one dedicated namespace.
77-
* All pods with mount utilities for a volume plugin `kubernetes.io/foo` must have label `mount.kubernetes.io/foo=true`.
78-
* All pods with mount utilities for a flex volume with driver `bar` must have label `mounter.kubernetes.io/flexvolume/bar=true` so there can be different DaemonSets for different flex drivers instead of one monolithic DaemonSet with drivers for all flex volumes.
73+
* The pods with mount utilities run a daemon with gRPC server that implements `VolumExecService` defined below.
74+
* Upon starting, this daemon puts a UNIX domain socket into `/var/lib/kubelet/plugin-sockets/` directory on the host. This way, kubelet is able to discover all pods with mount utilities on a node.
75+
* Kubernetes will ship implementation of this daemon that creates the socket on the right place and simply executes anything what kubelet asks for.
7976

80-
To sum it up, it's just a daemon set that spawns privileged pods with some labels, running a simple init and waiting for Kubernetes to do `kubectl exec <the pod> <some utility> <args>`.
77+
To sum it up, it's just a daemon set that spawns privileged pods, running a simple init + a daemon that executes mount utilities as requested by kubelet via gRPC.
8178

8279
## Design
8380

@@ -97,30 +94,69 @@ We propose:
9794
This ensures that kubelet runs out of the box on any distro without any configuration done by the cluster admin.
9895

9996
### Volume plugins
100-
* All volume plugins need to be updated to use a new `VolumeExec` interface to call external utilities like `mount`, `mkfs`, `rbd lock` and such. Implementation of the interface will be provided by caller and will lead either to `exec` on the host or `kubectl exec` or `docker exec` in a remote or local pod with utilities for appropriate volume plugin (or docker-exec-like command if another container engine is used).
97+
* All volume plugins need to be updated to use a new `mount.Exec` interface to call external utilities like `mount`, `mkfs`, `rbd lock` and such. Implementation of the interface will be provided by caller and will lead either to simple `os.exec` on the host or a gRPC call to a socket in `/var/lib/kubelet/plugin-sockets/` directory.
10198

102-
### Controller
103-
* There will be new parameter to kube-controller-manager and kubelet:
104-
* `--experimental-mount-namespace`, which specifies a dedicated namespace where all pods with mount utilities reside. It would default to `kube-mount`.
105-
* Whenever PV or attach/detach controller needs to call a volume plugin, it looks for *any* running pod in the specified namespace with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin so it all mount utilities are executed as `kubectl exec <pod> xxx` (of course, we'll use clientset interface instead of executing `kubectl`).
106-
* If such pod does not exist, it executes the mount utilities on the host as usual.
107-
* During alpha, no controller-manager changes will be done. That means that Ceph RBD provisioner will still require `/usr/bin/rbd` installed on the master. All other volume plugins will work without any problem, as they don't execute any utility when attaching/detaching/provisioning/deleting a volume.
99+
### Controllers
100+
TODO: how will controller-manager talk to a remote pod? It's relatively easy to do something like `kubectl exec <mount pod>` from controller-manager, however it's harder to *discover* the right pod.
108101

109102
### Kubelet
110-
* kubelet will get the same parameters as described above, `--experimental-mount-namespace`.
111-
* When kubelet talks to a volume plugin *foo*, it finds a pod in the dedicated namespace running on the node with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin with `VolumeExec` pointing to the pod. All utilities that are executed by the volume plugin for mount/unmount/waitForAttach are executed in the pod running on the node.
112-
* In such pod does not exist, it executes the mount utilities on the host as usual.
103+
* When kubelet talks to a volume plugin, it looks for a socket named `/var/lib/kubelet/plugin-sockets/<plugin-name>`. This allows for easier discovery of flex volume drivers - probe in https://github.com/kubernetes/community/pull/833 needs to scan `/var/lib/kubelet/plugin-sockets/` too and find sockets in any new subdirectories.
104+
* If the socket does not exist, kubelet gives the volume plugin plain `os.Exec` as implementation of `mount.Exec` interface and all mount utilities are executed on the host.
105+
* If the socket exists, kubelet gives the volume plugin `GRPCExec` as implementation of `mount.Exec` and all mount utilities are executed via gRPC on the socket which presumably leads to a pod with mount utilities running a gRPC server.
113106

114-
As consequence, kubelet will try to run mount utilities on the host when it starts and has not received pods with mount utilities yet. This is likely to fails with a cryptic error:
107+
As consequence, kubelet will try to run mount utilities on the host when it starts and has not received pods with mount utilities yet (and thus `/var/lib/kubelet/plugin-sockets/` is empty). This is likely to fails with a cryptic error:
115108
```
116109
mount: wrong fs type, bad option, bad superblock on 192.168.0.1:/test_vol,
117110
missing codepage or helper program, or other error
118111
```
119112

120113
Kubelet will periodically retry mounting the volume and it will eventually succeed when pod with mount utilities is scheduled and running on the node.
121114

122-
### VolumePluginMgr
123-
Volume plugin manager runs in attach/detach controller, PV controller and in kubelet and holds a list of all volume plugins. This list of volume plugins is discovered during process startup. Especially for flex volumes, the list is read from `/usr/libexec/kubernetes/...` and it is never updated. We need to update VolumePluginMgr to add flex volumes from running pods.
115+
### gRPC API
116+
117+
`VolumeExecService` is a simple gRPC service that allows to execute anything via gRPC:
118+
119+
```protobuf
120+
service VolumeExecService {
121+
// Exec executes a command and returns its output.
122+
rpc Exec(ExecRequest) returns (ExecResponse) {}
123+
}
124+
125+
message ExecRequest {
126+
// Command to execute
127+
string cmd = 1;
128+
// Command arguments
129+
repeated string args = 2;
130+
}
131+
132+
message ExecResponse {
133+
enum ExecError {
134+
// Helps to reconstruct exec.ErrNotFound.
135+
COMMAND_NOT_FOUND = 1;
136+
// Helps to reconstruct exec.ExitError.
137+
EXIT_CODE = 2;
138+
}
139+
Error error = 1;
140+
// Exit code of the command. This field is nozero only when Error == EXIT_CODE.
141+
int32 exit_code = 2;
142+
// Capture of combined stdout + stderr.
143+
string output = 3;
144+
}
145+
```
146+
147+
* Both `ExecRequest` and `ExecResponse` are tailored for execution of mount utilities that don't need any stdin and stdout+stderr are typically short. Therefore there is no streaming of these file descriptors.
148+
149+
* No authentication / authorization is done on the server side, anyone who connects to the socket can execute anything. It is expected that only root has access to `/var/lib/kubelet/plugin-sockets/`.
150+
151+
* `.proto` file for this API will be stored in `k8s.io/kubernetes/pkg/version/apis/exec/v1alpha1`.
152+
153+
* `hack/update-generated-runtime.sh` will be updated to generate go files for this API.
154+
155+
* Should it be renamed to `update-generated-grpc-apis.sh`?
156+
157+
* Kubernetes will ship a daemon with server implementation of this API in `cmd/volume-exec`. This implementation simply calls `os.Exec` for each `ExecRequest` it gets and returns the right response.
158+
159+
* Authors of container images with mount utilities can then add this `volume-exec` daemon to their image, they don't need to care about anything else.
124160

125161
### Upgrade
126162
Upgrade of the DaemonSet with pods with mount utilities needs to be done node by node and with extra care. The pods may run fuse daemons and killing such pod with glusterfs fuse daemon would kill all pods that use glusterfs on the same node.
@@ -133,13 +169,32 @@ In order to update the DaemonSet, admin must do for every node:
133169
Is there a way how to make it with DaemonSet rolling update? Is there any other way how to do this upgrade better?
134170

135171

172+
## Open items
173+
174+
* How will controller-manager talk to pods with mount utilities?
175+
176+
1. Mount pods expose a gRPC service.
177+
* controller-manager must be configured with the service namespace + name.
178+
* Some authentication must be implemented (=additional configuration of certificates and whatnot).
179+
* -> seems to be complicated.
180+
181+
2. Mount pods run in a dedicated namespace and have labels that tell which volume plugins they can handle.
182+
* controller manager scans a namespace with a labelselector and does `kubectl exec <pod>` to execute anything in the pod.
183+
* Needs configuration of the namespace.
184+
* Admin must make sure that nothing else can run in the namespace (e.g. rogue pods that would steal volumes).
185+
* Admin must configure access to the namespace so only pv-controller and attach-detach-controller can do `exec` there.
186+
187+
3. We allow pods to run on hosts that run controller-manager.
188+
189+
* Usual socket in `/var/lib/kubelet/plugin-sockets` will work.
190+
* Can it work on GKE?
191+
136192
## Implementation notes
137193

138-
* During alpha, only kubelet will be updated and all volume plugins except flex will be updated.
139-
* During alpha, `kubelet --experimental-mount-namespace=<ns>` must be used to enable this feature so it does not break anything accidentally if this feature is buggy. In beta and GA, this feature will be enabled by default and `--experimental-mount-namespace=` could be used to explicitly disable this feature or change the namespace.
194+
* During alpha, only kubelet will be updated
195+
* Depending on flex dynamic probing in https://github.com/kubernetes/community/pull/833, flex may or may not be supported during alpha.
140196

141197
Consequences:
142198

143199
* Ceph RBD dynamic provisioning will still need `/usr/bin/rbd` installed on master(s). All other volume plugins will work without any problem, as they don't execute any utility when attaching/detaching/provisioning/deleting a volume.
144-
* Flex still needs `/usr/libexec` scripts deployed to all nodes and master(s).
145-
200+
* Flex still needs `/usr/libexec` scripts deployed to master(s) and maybe to nodes.

0 commit comments

Comments
 (0)