Skip to content

Commit b219775

Browse files
borkmanndavem330
authored andcommitted
bpf: add support for persistent maps/progs
This work adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users. Just to name one example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits. So a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path. The current workaround is that Unix domain sockets (UDS) need to be instrumented in order to pass the created eBPF map/program file descriptors to a third party management daemon through UDS' socket passing facility. This makes it a bit complicated to deploy shared eBPF maps or programs (programs f.e. for tail calls) among various processes. We've been brainstorming on how we could tackle this issue and various approches have been tried out so far, which can be read up further in the below reference. The architecture we eventually ended up with is a minimal file system that can hold map/prog objects. The file system is a per mount namespace singleton, and the default mount point is /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor along with a pathname is being passed to bpf(2) that in turn creates (we call it eBPF object pinning) the file system nodes. Only the pathname is being passed to bpf(2) for getting a new BPF file descriptor to an existing node. The user can use that to access maps and progs later on, through bpf(2). Removal of file system nodes is being managed through normal VFS functions such as unlink(2), etc. The file system code is kept to a very minimum and can be further extended later on. The next step I'm working on is to add dump eBPF map/prog commands to bpf(2), so that a specification from a given file descriptor can be retrieved. This can be used by things like CRIU but also applications can inspect the meta data after calling BPF_OBJ_GET. Big thanks also to Alexei and Hannes who significantly contributed in the design discussion that eventually let us end up with this architecture here. Reference: https://lkml.org/lkml/2015/10/15/925 Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent e9d8afa commit b219775

File tree

6 files changed

+433
-41
lines changed

6 files changed

+433
-41
lines changed

include/linux/bpf.h

+7
Original file line numberDiff line numberDiff line change
@@ -167,11 +167,18 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
167167
void bpf_prog_put(struct bpf_prog *prog);
168168
void bpf_prog_put_rcu(struct bpf_prog *prog);
169169

170+
struct bpf_map *bpf_map_get(u32 ufd);
170171
struct bpf_map *__bpf_map_get(struct fd f);
171172
void bpf_map_put(struct bpf_map *map);
172173

173174
extern int sysctl_unprivileged_bpf_disabled;
174175

176+
int bpf_map_new_fd(struct bpf_map *map);
177+
int bpf_prog_new_fd(struct bpf_prog *prog);
178+
179+
int bpf_obj_pin_user(u32 ufd, const char __user *pathname);
180+
int bpf_obj_get_user(const char __user *pathname);
181+
175182
/* verify correctness of eBPF program */
176183
int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
177184
#else

include/uapi/linux/bpf.h

+8-37
Original file line numberDiff line numberDiff line change
@@ -63,50 +63,16 @@ struct bpf_insn {
6363
__s32 imm; /* signed immediate constant */
6464
};
6565

66-
/* BPF syscall commands */
66+
/* BPF syscall commands, see bpf(2) man-page for details. */
6767
enum bpf_cmd {
68-
/* create a map with given type and attributes
69-
* fd = bpf(BPF_MAP_CREATE, union bpf_attr *, u32 size)
70-
* returns fd or negative error
71-
* map is deleted when fd is closed
72-
*/
7368
BPF_MAP_CREATE,
74-
75-
/* lookup key in a given map
76-
* err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
77-
* Using attr->map_fd, attr->key, attr->value
78-
* returns zero and stores found elem into value
79-
* or negative error
80-
*/
8169
BPF_MAP_LOOKUP_ELEM,
82-
83-
/* create or update key/value pair in a given map
84-
* err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
85-
* Using attr->map_fd, attr->key, attr->value, attr->flags
86-
* returns zero or negative error
87-
*/
8870
BPF_MAP_UPDATE_ELEM,
89-
90-
/* find and delete elem by key in a given map
91-
* err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
92-
* Using attr->map_fd, attr->key
93-
* returns zero or negative error
94-
*/
9571
BPF_MAP_DELETE_ELEM,
96-
97-
/* lookup key in a given map and return next key
98-
* err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
99-
* Using attr->map_fd, attr->key, attr->next_key
100-
* returns zero and stores next key or negative error
101-
*/
10272
BPF_MAP_GET_NEXT_KEY,
103-
104-
/* verify and load eBPF program
105-
* prog_fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size)
106-
* Using attr->prog_type, attr->insns, attr->license
107-
* returns fd or negative error
108-
*/
10973
BPF_PROG_LOAD,
74+
BPF_OBJ_PIN,
75+
BPF_OBJ_GET,
11076
};
11177

11278
enum bpf_map_type {
@@ -160,6 +126,11 @@ union bpf_attr {
160126
__aligned_u64 log_buf; /* user supplied buffer */
161127
__u32 kern_version; /* checked when prog_type=kprobe */
162128
};
129+
130+
struct { /* anonymous struct used by BPF_OBJ_* commands */
131+
__aligned_u64 pathname;
132+
__u32 bpf_fd;
133+
};
163134
} __attribute__((aligned(8)));
164135

165136
/* integer value in 'imm' field of BPF_CALL instruction selects which helper

include/uapi/linux/magic.h

+1
Original file line numberDiff line numberDiff line change
@@ -75,5 +75,6 @@
7575
#define ANON_INODE_FS_MAGIC 0x09041934
7676
#define BTRFS_TEST_MAGIC 0x73727279
7777
#define NSFS_MAGIC 0x6e736673
78+
#define BPF_FS_MAGIC 0xcafe4a11
7879

7980
#endif /* __LINUX_MAGIC_H__ */

kernel/bpf/Makefile

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
obj-y := core.o
2-
obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o helpers.o
2+
3+
obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
4+
obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o

0 commit comments

Comments
 (0)