Skip to content

Commit 85d33df

Browse files
iamkafaiAlexei Starovoitov
authored and
Alexei Starovoitov
committed
bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value is a kernel struct with its func ptr implemented in bpf prog. This new map is the interface to register/unregister/introspect a bpf implemented kernel struct. The kernel struct is actually embedded inside another new struct (or called the "value" struct in the code). For example, "struct tcp_congestion_ops" is embbeded in: struct bpf_struct_ops_tcp_congestion_ops { refcount_t refcnt; enum bpf_struct_ops_state state; struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ } The map value is "struct bpf_struct_ops_tcp_congestion_ops". The "bpftool map dump" will then be able to show the state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. number of tcp_sock in the tcp_congestion_ops case). This "value" struct is created automatically by a macro. Having a separate "value" struct will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some initialization works before registering the struct_ops to the kernel subsystem). The libbpf will take care of finding and populating the "struct bpf_struct_ops_XYZ" from "struct XYZ". Register a struct_ops to a kernel subsystem: 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the running kernel. Instead of reusing the attr->btf_value_type_id, btf_vmlinux_value_type_id s added such that attr->btf_fd can still be used as the "user" btf which could store other useful sysadmin/debug info that may be introduced in the furture, e.g. creation-date/compiler-details/map-creator...etc. 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described in the running kernel btf. Populate the value of this object. The function ptr should be populated with the prog fds. 4. Call BPF_MAP_UPDATE with the object created in (3) as the map value. The key is always "0". During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's args as an array of u64 is generated. BPF_MAP_UPDATE also allows the specific struct_ops to do some final checks in "st_ops->init_member()" (e.g. ensure all mandatory func ptrs are implemented). If everything looks good, it will register this kernel struct to the kernel subsystem. The map will not allow further update from this point. Unregister a struct_ops from the kernel subsystem: BPF_MAP_DELETE with key "0". Introspect a struct_ops: BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will have the prog _id_ populated as the func ptr. The map value state (enum bpf_struct_ops_state) will transit from: INIT (map created) => INUSE (map updated, i.e. reg) => TOBEFREE (map value deleted, i.e. unreg) The kernel subsystem needs to call bpf_struct_ops_get() and bpf_struct_ops_put() to manage the "refcnt" in the "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt for the purose of tracking the subsystem usage. Another approach is to reuse the map->refcnt and then "show" (i.e. during map_lookup) the subsystem's usage by doing map->refcnt - map->usercnt to filter out the map-fd/pinned-map usage. However, that will also tie down the future semantics of map->refcnt and map->usercnt. The very first subsystem's refcnt (during reg()) holds one count to map->refcnt. When the very last subsystem's refcnt is gone, it will also release the map->refcnt. All bpf_prog will be freed when the map->refcnt reaches 0 (i.e. during map_free()). Here is how the bpftool map command will look like: [root@arch-fb-vm1 bpf]# bpftool map show 6: struct_ops name dctcp flags 0x0 key 4B value 256B max_entries 1 memlock 4096B btf_id 6 [root@arch-fb-vm1 bpf]# bpftool map dump id 6 [{ "value": { "refcnt": { "refs": { "counter": 1 } }, "state": 1, "data": { "list": { "next": 0, "prev": 0 }, "key": 0, "flags": 2, "init": 24, "release": 0, "ssthresh": 25, "cong_avoid": 30, "set_state": 27, "cwnd_event": 28, "in_ack_event": 26, "undo_cwnd": 29, "pkts_acked": 0, "min_tso_segs": 0, "sndbuf_expand": 0, "cong_control": 0, "get_info": 0, "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 ], "owner": 0 } } } ] Misc Notes: * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup. It does an inplace update on "*value" instead returning a pointer to syscall.c. Otherwise, it needs a separate copy of "zero" value for the BPF_STRUCT_OPS_STATE_INIT to avoid races. * The bpf_struct_ops_map_delete_elem() is also called without preempt_disable() from map_delete_elem(). It is because the "->unreg()" may requires sleepable context, e.g. the "tcp_unregister_congestion_control()". * "const" is added to some of the existing "struct btf_func_model *" function arg to avoid a compiler warning caused by this patch. Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
1 parent 27ae799 commit 85d33df

File tree

11 files changed

+642
-47
lines changed

11 files changed

+642
-47
lines changed

Diff for: arch/x86/net/bpf_jit_comp.c

+8-10
Original file line numberDiff line numberDiff line change
@@ -1328,7 +1328,7 @@ xadd: if (is_imm8(insn->off))
13281328
return proglen;
13291329
}
13301330

1331-
static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args,
1331+
static void save_regs(const struct btf_func_model *m, u8 **prog, int nr_args,
13321332
int stack_size)
13331333
{
13341334
int i;
@@ -1344,7 +1344,7 @@ static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args,
13441344
-(stack_size - i * 8));
13451345
}
13461346

1347-
static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args,
1347+
static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args,
13481348
int stack_size)
13491349
{
13501350
int i;
@@ -1361,7 +1361,7 @@ static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args,
13611361
-(stack_size - i * 8));
13621362
}
13631363

1364-
static int invoke_bpf(struct btf_func_model *m, u8 **pprog,
1364+
static int invoke_bpf(const struct btf_func_model *m, u8 **pprog,
13651365
struct bpf_prog **progs, int prog_cnt, int stack_size)
13661366
{
13671367
u8 *prog = *pprog;
@@ -1456,7 +1456,8 @@ static int invoke_bpf(struct btf_func_model *m, u8 **pprog,
14561456
* add rsp, 8 // skip eth_type_trans's frame
14571457
* ret // return to its caller
14581458
*/
1459-
int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
1459+
int arch_prepare_bpf_trampoline(void *image, void *image_end,
1460+
const struct btf_func_model *m, u32 flags,
14601461
struct bpf_prog **fentry_progs, int fentry_cnt,
14611462
struct bpf_prog **fexit_progs, int fexit_cnt,
14621463
void *orig_call)
@@ -1523,13 +1524,10 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
15231524
/* skip our return address and return to parent */
15241525
EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
15251526
EMIT1(0xC3); /* ret */
1526-
/* One half of the page has active running trampoline.
1527-
* Another half is an area for next trampoline.
1528-
* Make sure the trampoline generation logic doesn't overflow.
1529-
*/
1530-
if (WARN_ON_ONCE(prog - (u8 *)image > PAGE_SIZE / 2 - BPF_INSN_SAFETY))
1527+
/* Make sure the trampoline generation logic doesn't overflow */
1528+
if (WARN_ON_ONCE(prog > (u8 *)image_end - BPF_INSN_SAFETY))
15311529
return -EFAULT;
1532-
return 0;
1530+
return prog - (u8 *)image;
15331531
}
15341532

15351533
static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond)

Diff for: include/linux/bpf.h

+47-2
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
#include <linux/u64_stats_sync.h>
1818
#include <linux/refcount.h>
1919
#include <linux/mutex.h>
20+
#include <linux/module.h>
2021

2122
struct bpf_verifier_env;
2223
struct bpf_verifier_log;
@@ -106,6 +107,7 @@ struct bpf_map {
106107
struct btf *btf;
107108
struct bpf_map_memory memory;
108109
char name[BPF_OBJ_NAME_LEN];
110+
u32 btf_vmlinux_value_type_id;
109111
bool unpriv_array;
110112
bool frozen; /* write-once; write-protected by freeze_mutex */
111113
/* 22 bytes hole */
@@ -183,7 +185,8 @@ static inline bool bpf_map_offload_neutral(const struct bpf_map *map)
183185

184186
static inline bool bpf_map_support_seq_show(const struct bpf_map *map)
185187
{
186-
return map->btf && map->ops->map_seq_show_elem;
188+
return (map->btf_value_type_id || map->btf_vmlinux_value_type_id) &&
189+
map->ops->map_seq_show_elem;
187190
}
188191

189192
int map_check_no_btf(const struct bpf_map *map,
@@ -441,7 +444,8 @@ struct btf_func_model {
441444
* fentry = a set of program to run before calling original function
442445
* fexit = a set of program to run after original function
443446
*/
444-
int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
447+
int arch_prepare_bpf_trampoline(void *image, void *image_end,
448+
const struct btf_func_model *m, u32 flags,
445449
struct bpf_prog **fentry_progs, int fentry_cnt,
446450
struct bpf_prog **fexit_progs, int fexit_cnt,
447451
void *orig_call);
@@ -672,6 +676,7 @@ struct bpf_array_aux {
672676
struct work_struct work;
673677
};
674678

679+
struct bpf_struct_ops_value;
675680
struct btf_type;
676681
struct btf_member;
677682

@@ -681,21 +686,61 @@ struct bpf_struct_ops {
681686
int (*init)(struct btf *btf);
682687
int (*check_member)(const struct btf_type *t,
683688
const struct btf_member *member);
689+
int (*init_member)(const struct btf_type *t,
690+
const struct btf_member *member,
691+
void *kdata, const void *udata);
692+
int (*reg)(void *kdata);
693+
void (*unreg)(void *kdata);
684694
const struct btf_type *type;
695+
const struct btf_type *value_type;
685696
const char *name;
686697
struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
687698
u32 type_id;
699+
u32 value_id;
688700
};
689701

690702
#if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
703+
#define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
691704
const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id);
692705
void bpf_struct_ops_init(struct btf *btf);
706+
bool bpf_struct_ops_get(const void *kdata);
707+
void bpf_struct_ops_put(const void *kdata);
708+
int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
709+
void *value);
710+
static inline bool bpf_try_module_get(const void *data, struct module *owner)
711+
{
712+
if (owner == BPF_MODULE_OWNER)
713+
return bpf_struct_ops_get(data);
714+
else
715+
return try_module_get(owner);
716+
}
717+
static inline void bpf_module_put(const void *data, struct module *owner)
718+
{
719+
if (owner == BPF_MODULE_OWNER)
720+
bpf_struct_ops_put(data);
721+
else
722+
module_put(owner);
723+
}
693724
#else
694725
static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id)
695726
{
696727
return NULL;
697728
}
698729
static inline void bpf_struct_ops_init(struct btf *btf) { }
730+
static inline bool bpf_try_module_get(const void *data, struct module *owner)
731+
{
732+
return try_module_get(owner);
733+
}
734+
static inline void bpf_module_put(const void *data, struct module *owner)
735+
{
736+
module_put(owner);
737+
}
738+
static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map,
739+
void *key,
740+
void *value)
741+
{
742+
return -EINVAL;
743+
}
699744
#endif
700745

701746
struct bpf_array {

Diff for: include/linux/bpf_types.h

+3
Original file line numberDiff line numberDiff line change
@@ -109,3 +109,6 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
109109
#endif
110110
BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
111111
BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
112+
#if defined(CONFIG_BPF_JIT)
113+
BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
114+
#endif

Diff for: include/linux/btf.h

+13
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
#include <linux/types.h>
88
#include <uapi/linux/btf.h>
99

10+
#define BTF_TYPE_EMIT(type) ((void)(type *)0)
11+
1012
struct btf;
1113
struct btf_member;
1214
struct btf_type;
@@ -60,6 +62,10 @@ const struct btf_type *btf_type_resolve_ptr(const struct btf *btf,
6062
u32 id, u32 *res_id);
6163
const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf,
6264
u32 id, u32 *res_id);
65+
const struct btf_type *
66+
btf_resolve_size(const struct btf *btf, const struct btf_type *type,
67+
u32 *type_size, const struct btf_type **elem_type,
68+
u32 *total_nelems);
6369

6470
#define for_each_member(i, struct_type, member) \
6571
for (i = 0, member = btf_type_member(struct_type); \
@@ -106,6 +112,13 @@ static inline bool btf_type_kflag(const struct btf_type *t)
106112
return BTF_INFO_KFLAG(t->info);
107113
}
108114

115+
static inline u32 btf_member_bit_offset(const struct btf_type *struct_type,
116+
const struct btf_member *member)
117+
{
118+
return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset)
119+
: member->offset;
120+
}
121+
109122
static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type,
110123
const struct btf_member *member)
111124
{

Diff for: include/uapi/linux/bpf.h

+6-1
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,7 @@ enum bpf_map_type {
136136
BPF_MAP_TYPE_STACK,
137137
BPF_MAP_TYPE_SK_STORAGE,
138138
BPF_MAP_TYPE_DEVMAP_HASH,
139+
BPF_MAP_TYPE_STRUCT_OPS,
139140
};
140141

141142
/* Note that tracing related programs such as
@@ -398,6 +399,10 @@ union bpf_attr {
398399
__u32 btf_fd; /* fd pointing to a BTF type data */
399400
__u32 btf_key_type_id; /* BTF type_id of the key */
400401
__u32 btf_value_type_id; /* BTF type_id of the value */
402+
__u32 btf_vmlinux_value_type_id;/* BTF type_id of a kernel-
403+
* struct stored as the
404+
* map value
405+
*/
401406
};
402407

403408
struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -3350,7 +3355,7 @@ struct bpf_map_info {
33503355
__u32 map_flags;
33513356
char name[BPF_OBJ_NAME_LEN];
33523357
__u32 ifindex;
3353-
__u32 :32;
3358+
__u32 btf_vmlinux_value_type_id;
33543359
__u64 netns_dev;
33553360
__u64 netns_ino;
33563361
__u32 btf_id;

0 commit comments

Comments
 (0)