Skip to content

Commit fec56f5

Browse files
Alexei Starovoitovborkmann
Alexei Starovoitov
authored andcommitted
bpf: Introduce BPF trampoline
Introduce BPF trampoline concept to allow kernel code to call into BPF programs with practically zero overhead. The trampoline generation logic is architecture dependent. It's converting native calling convention into BPF calling convention. BPF ISA is 64-bit (even on 32-bit architectures). The registers R1 to R5 are used to pass arguments into BPF functions. The main BPF program accepts only single argument "ctx" in R1. Whereas CPU native calling convention is different. x86-64 is passing first 6 arguments in registers and the rest on the stack. x86-32 is passing first 3 arguments in registers. sparc64 is passing first 6 in registers. And so on. The trampolines between BPF and kernel already exist. BPF_CALL_x macros in include/linux/filter.h statically compile trampolines from BPF into kernel helpers. They convert up to five u64 arguments into kernel C pointers and integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On 32-bit architecture they're meaningful. The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert kernel function arguments into array of u64s that BPF program consumes via R1=ctx pointer. This patch set is doing the same job as __bpf_trace_##call() static trampolines, but dynamically for any kernel function. There are ~22k global kernel functions that are attachable via nop at function entry. The function arguments and types are described in BTF. The job of btf_distill_func_proto() function is to extract useful information from BTF into "function model" that architecture dependent trampoline generators will use to generate assembly code to cast kernel function arguments into array of u64s. For example the kernel function eth_type_trans has two pointers. They will be casted to u64 and stored into stack of generated trampoline. The pointer to that stack space will be passed into BPF program in R1. On x86-64 such generated trampoline will consume 16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will make sure that only two u64 are accessed read-only by BPF program. The verifier will also recognize the precise type of the pointers being accessed and will not allow typecasting of the pointer to a different type within BPF program. The tracing use case in the datacenter demonstrated that certain key kernel functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always active. Other functions have both kprobe and kretprobe. So it is essential to keep both kernel code and BPF programs executing at maximum speed. Hence generated BPF trampoline is re-generated every time new program is attached or detached to maintain maximum performance. To avoid the high cost of retpoline the attached BPF programs are called directly. __bpf_prog_enter/exit() are used to support per-program execution stats. In the future this logic will be optimized further by adding support for bpf_stats_enabled_key inside generated assembly code. Introduction of preemptible and sleepable BPF programs will completely remove the need to call to __bpf_prog_enter/exit(). Detach of a BPF program from the trampoline should not fail. To avoid memory allocation in detach path the half of the page is used as a reserve and flipped after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly which is enough for BPF tracing use cases. This limit can be increased in the future. BPF_TRACE_FENTRY programs have access to raw kernel function arguments while BPF_TRACE_FEXIT programs have access to kernel return value as well. Often kprobe BPF program remembers function arguments in a map while kretprobe fetches arguments from a map and analyzes them together with return value. BPF_TRACE_FEXIT accelerates this typical use case. Recursion prevention for kprobe BPF programs is done via per-cpu bpf_prog_active counter. In practice that turned out to be a mistake. It caused programs to randomly skip execution. The tracing tools missed results they were looking for. Hence BPF trampoline doesn't provide builtin recursion prevention. It's a job of BPF program itself and will be addressed in the follow up patches. BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases in the future. For example to remove retpoline cost from XDP programs. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
1 parent 5964b20 commit fec56f5

File tree

9 files changed

+735
-10
lines changed

9 files changed

+735
-10
lines changed

Diff for: arch/x86/net/bpf_jit_comp.c

+209-2
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
9898

9999
/* Pick a register outside of BPF range for JIT internal work */
100100
#define AUX_REG (MAX_BPF_JIT_REG + 1)
101+
#define X86_REG_R9 (MAX_BPF_JIT_REG + 2)
101102

102103
/*
103104
* The following table maps BPF registers to x86-64 registers.
@@ -106,8 +107,8 @@ static int bpf_size_to_x86_bytes(int bpf_size)
106107
* register in load/store instructions, it always needs an
107108
* extra byte of encoding and is callee saved.
108109
*
109-
* Also x86-64 register R9 is unused. x86-64 register R10 is
110-
* used for blinding (if enabled).
110+
* x86-64 register R9 is not used by BPF programs, but can be used by BPF
111+
* trampoline. x86-64 register R10 is used for blinding (if enabled).
111112
*/
112113
static const int reg2hex[] = {
113114
[BPF_REG_0] = 0, /* RAX */
@@ -123,6 +124,7 @@ static const int reg2hex[] = {
123124
[BPF_REG_FP] = 5, /* RBP readonly */
124125
[BPF_REG_AX] = 2, /* R10 temp register */
125126
[AUX_REG] = 3, /* R11 temp register */
127+
[X86_REG_R9] = 1, /* R9 register, 6th function argument */
126128
};
127129

128130
static const int reg2pt_regs[] = {
@@ -150,6 +152,7 @@ static bool is_ereg(u32 reg)
150152
BIT(BPF_REG_7) |
151153
BIT(BPF_REG_8) |
152154
BIT(BPF_REG_9) |
155+
BIT(X86_REG_R9) |
153156
BIT(BPF_REG_AX));
154157
}
155158

@@ -1233,6 +1236,210 @@ xadd: if (is_imm8(insn->off))
12331236
return proglen;
12341237
}
12351238

1239+
static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args,
1240+
int stack_size)
1241+
{
1242+
int i;
1243+
/* Store function arguments to stack.
1244+
* For a function that accepts two pointers the sequence will be:
1245+
* mov QWORD PTR [rbp-0x10],rdi
1246+
* mov QWORD PTR [rbp-0x8],rsi
1247+
*/
1248+
for (i = 0; i < min(nr_args, 6); i++)
1249+
emit_stx(prog, bytes_to_bpf_size(m->arg_size[i]),
1250+
BPF_REG_FP,
1251+
i == 5 ? X86_REG_R9 : BPF_REG_1 + i,
1252+
-(stack_size - i * 8));
1253+
}
1254+
1255+
static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args,
1256+
int stack_size)
1257+
{
1258+
int i;
1259+
1260+
/* Restore function arguments from stack.
1261+
* For a function that accepts two pointers the sequence will be:
1262+
* EMIT4(0x48, 0x8B, 0x7D, 0xF0); mov rdi,QWORD PTR [rbp-0x10]
1263+
* EMIT4(0x48, 0x8B, 0x75, 0xF8); mov rsi,QWORD PTR [rbp-0x8]
1264+
*/
1265+
for (i = 0; i < min(nr_args, 6); i++)
1266+
emit_ldx(prog, bytes_to_bpf_size(m->arg_size[i]),
1267+
i == 5 ? X86_REG_R9 : BPF_REG_1 + i,
1268+
BPF_REG_FP,
1269+
-(stack_size - i * 8));
1270+
}
1271+
1272+
static int invoke_bpf(struct btf_func_model *m, u8 **pprog,
1273+
struct bpf_prog **progs, int prog_cnt, int stack_size)
1274+
{
1275+
u8 *prog = *pprog;
1276+
int cnt = 0, i;
1277+
1278+
for (i = 0; i < prog_cnt; i++) {
1279+
if (emit_call(&prog, __bpf_prog_enter, prog))
1280+
return -EINVAL;
1281+
/* remember prog start time returned by __bpf_prog_enter */
1282+
emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0);
1283+
1284+
/* arg1: lea rdi, [rbp - stack_size] */
1285+
EMIT4(0x48, 0x8D, 0x7D, -stack_size);
1286+
/* arg2: progs[i]->insnsi for interpreter */
1287+
if (!progs[i]->jited)
1288+
emit_mov_imm64(&prog, BPF_REG_2,
1289+
(long) progs[i]->insnsi >> 32,
1290+
(u32) (long) progs[i]->insnsi);
1291+
/* call JITed bpf program or interpreter */
1292+
if (emit_call(&prog, progs[i]->bpf_func, prog))
1293+
return -EINVAL;
1294+
1295+
/* arg1: mov rdi, progs[i] */
1296+
emit_mov_imm64(&prog, BPF_REG_1, (long) progs[i] >> 32,
1297+
(u32) (long) progs[i]);
1298+
/* arg2: mov rsi, rbx <- start time in nsec */
1299+
emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6);
1300+
if (emit_call(&prog, __bpf_prog_exit, prog))
1301+
return -EINVAL;
1302+
}
1303+
*pprog = prog;
1304+
return 0;
1305+
}
1306+
1307+
/* Example:
1308+
* __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
1309+
* its 'struct btf_func_model' will be nr_args=2
1310+
* The assembly code when eth_type_trans is executing after trampoline:
1311+
*
1312+
* push rbp
1313+
* mov rbp, rsp
1314+
* sub rsp, 16 // space for skb and dev
1315+
* push rbx // temp regs to pass start time
1316+
* mov qword ptr [rbp - 16], rdi // save skb pointer to stack
1317+
* mov qword ptr [rbp - 8], rsi // save dev pointer to stack
1318+
* call __bpf_prog_enter // rcu_read_lock and preempt_disable
1319+
* mov rbx, rax // remember start time in bpf stats are enabled
1320+
* lea rdi, [rbp - 16] // R1==ctx of bpf prog
1321+
* call addr_of_jited_FENTRY_prog
1322+
* movabsq rdi, 64bit_addr_of_struct_bpf_prog // unused if bpf stats are off
1323+
* mov rsi, rbx // prog start time
1324+
* call __bpf_prog_exit // rcu_read_unlock, preempt_enable and stats math
1325+
* mov rdi, qword ptr [rbp - 16] // restore skb pointer from stack
1326+
* mov rsi, qword ptr [rbp - 8] // restore dev pointer from stack
1327+
* pop rbx
1328+
* leave
1329+
* ret
1330+
*
1331+
* eth_type_trans has 5 byte nop at the beginning. These 5 bytes will be
1332+
* replaced with 'call generated_bpf_trampoline'. When it returns
1333+
* eth_type_trans will continue executing with original skb and dev pointers.
1334+
*
1335+
* The assembly code when eth_type_trans is called from trampoline:
1336+
*
1337+
* push rbp
1338+
* mov rbp, rsp
1339+
* sub rsp, 24 // space for skb, dev, return value
1340+
* push rbx // temp regs to pass start time
1341+
* mov qword ptr [rbp - 24], rdi // save skb pointer to stack
1342+
* mov qword ptr [rbp - 16], rsi // save dev pointer to stack
1343+
* call __bpf_prog_enter // rcu_read_lock and preempt_disable
1344+
* mov rbx, rax // remember start time if bpf stats are enabled
1345+
* lea rdi, [rbp - 24] // R1==ctx of bpf prog
1346+
* call addr_of_jited_FENTRY_prog // bpf prog can access skb and dev
1347+
* movabsq rdi, 64bit_addr_of_struct_bpf_prog // unused if bpf stats are off
1348+
* mov rsi, rbx // prog start time
1349+
* call __bpf_prog_exit // rcu_read_unlock, preempt_enable and stats math
1350+
* mov rdi, qword ptr [rbp - 24] // restore skb pointer from stack
1351+
* mov rsi, qword ptr [rbp - 16] // restore dev pointer from stack
1352+
* call eth_type_trans+5 // execute body of eth_type_trans
1353+
* mov qword ptr [rbp - 8], rax // save return value
1354+
* call __bpf_prog_enter // rcu_read_lock and preempt_disable
1355+
* mov rbx, rax // remember start time in bpf stats are enabled
1356+
* lea rdi, [rbp - 24] // R1==ctx of bpf prog
1357+
* call addr_of_jited_FEXIT_prog // bpf prog can access skb, dev, return value
1358+
* movabsq rdi, 64bit_addr_of_struct_bpf_prog // unused if bpf stats are off
1359+
* mov rsi, rbx // prog start time
1360+
* call __bpf_prog_exit // rcu_read_unlock, preempt_enable and stats math
1361+
* mov rax, qword ptr [rbp - 8] // restore eth_type_trans's return value
1362+
* pop rbx
1363+
* leave
1364+
* add rsp, 8 // skip eth_type_trans's frame
1365+
* ret // return to its caller
1366+
*/
1367+
int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
1368+
struct bpf_prog **fentry_progs, int fentry_cnt,
1369+
struct bpf_prog **fexit_progs, int fexit_cnt,
1370+
void *orig_call)
1371+
{
1372+
int cnt = 0, nr_args = m->nr_args;
1373+
int stack_size = nr_args * 8;
1374+
u8 *prog;
1375+
1376+
/* x86-64 supports up to 6 arguments. 7+ can be added in the future */
1377+
if (nr_args > 6)
1378+
return -ENOTSUPP;
1379+
1380+
if ((flags & BPF_TRAMP_F_RESTORE_REGS) &&
1381+
(flags & BPF_TRAMP_F_SKIP_FRAME))
1382+
return -EINVAL;
1383+
1384+
if (flags & BPF_TRAMP_F_CALL_ORIG)
1385+
stack_size += 8; /* room for return value of orig_call */
1386+
1387+
if (flags & BPF_TRAMP_F_SKIP_FRAME)
1388+
/* skip patched call instruction and point orig_call to actual
1389+
* body of the kernel function.
1390+
*/
1391+
orig_call += X86_CALL_SIZE;
1392+
1393+
prog = image;
1394+
1395+
EMIT1(0x55); /* push rbp */
1396+
EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */
1397+
EMIT4(0x48, 0x83, 0xEC, stack_size); /* sub rsp, stack_size */
1398+
EMIT1(0x53); /* push rbx */
1399+
1400+
save_regs(m, &prog, nr_args, stack_size);
1401+
1402+
if (fentry_cnt)
1403+
if (invoke_bpf(m, &prog, fentry_progs, fentry_cnt, stack_size))
1404+
return -EINVAL;
1405+
1406+
if (flags & BPF_TRAMP_F_CALL_ORIG) {
1407+
if (fentry_cnt)
1408+
restore_regs(m, &prog, nr_args, stack_size);
1409+
1410+
/* call original function */
1411+
if (emit_call(&prog, orig_call, prog))
1412+
return -EINVAL;
1413+
/* remember return value in a stack for bpf prog to access */
1414+
emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
1415+
}
1416+
1417+
if (fexit_cnt)
1418+
if (invoke_bpf(m, &prog, fexit_progs, fexit_cnt, stack_size))
1419+
return -EINVAL;
1420+
1421+
if (flags & BPF_TRAMP_F_RESTORE_REGS)
1422+
restore_regs(m, &prog, nr_args, stack_size);
1423+
1424+
if (flags & BPF_TRAMP_F_CALL_ORIG)
1425+
/* restore original return value back into RAX */
1426+
emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
1427+
1428+
EMIT1(0x5B); /* pop rbx */
1429+
EMIT1(0xC9); /* leave */
1430+
if (flags & BPF_TRAMP_F_SKIP_FRAME)
1431+
/* skip our return address and return to parent */
1432+
EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
1433+
EMIT1(0xC3); /* ret */
1434+
/* One half of the page has active running trampoline.
1435+
* Another half is an area for next trampoline.
1436+
* Make sure the trampoline generation logic doesn't overflow.
1437+
*/
1438+
if (WARN_ON_ONCE(prog - (u8 *)image > PAGE_SIZE / 2 - BPF_INSN_SAFETY))
1439+
return -EFAULT;
1440+
return 0;
1441+
}
1442+
12361443
struct x64_jit_data {
12371444
struct bpf_binary_header *header;
12381445
int *addrs;

Diff for: include/linux/bpf.h

+105
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414
#include <linux/numa.h>
1515
#include <linux/wait.h>
1616
#include <linux/u64_stats_sync.h>
17+
#include <linux/refcount.h>
18+
#include <linux/mutex.h>
1719

1820
struct bpf_verifier_env;
1921
struct bpf_verifier_log;
@@ -384,6 +386,100 @@ struct bpf_prog_stats {
384386
struct u64_stats_sync syncp;
385387
} __aligned(2 * sizeof(u64));
386388

389+
struct btf_func_model {
390+
u8 ret_size;
391+
u8 nr_args;
392+
u8 arg_size[MAX_BPF_FUNC_ARGS];
393+
};
394+
395+
/* Restore arguments before returning from trampoline to let original function
396+
* continue executing. This flag is used for fentry progs when there are no
397+
* fexit progs.
398+
*/
399+
#define BPF_TRAMP_F_RESTORE_REGS BIT(0)
400+
/* Call original function after fentry progs, but before fexit progs.
401+
* Makes sense for fentry/fexit, normal calls and indirect calls.
402+
*/
403+
#define BPF_TRAMP_F_CALL_ORIG BIT(1)
404+
/* Skip current frame and return to parent. Makes sense for fentry/fexit
405+
* programs only. Should not be used with normal calls and indirect calls.
406+
*/
407+
#define BPF_TRAMP_F_SKIP_FRAME BIT(2)
408+
409+
/* Different use cases for BPF trampoline:
410+
* 1. replace nop at the function entry (kprobe equivalent)
411+
* flags = BPF_TRAMP_F_RESTORE_REGS
412+
* fentry = a set of programs to run before returning from trampoline
413+
*
414+
* 2. replace nop at the function entry (kprobe + kretprobe equivalent)
415+
* flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME
416+
* orig_call = fentry_ip + MCOUNT_INSN_SIZE
417+
* fentry = a set of program to run before calling original function
418+
* fexit = a set of program to run after original function
419+
*
420+
* 3. replace direct call instruction anywhere in the function body
421+
* or assign a function pointer for indirect call (like tcp_congestion_ops->cong_avoid)
422+
* With flags = 0
423+
* fentry = a set of programs to run before returning from trampoline
424+
* With flags = BPF_TRAMP_F_CALL_ORIG
425+
* orig_call = original callback addr or direct function addr
426+
* fentry = a set of program to run before calling original function
427+
* fexit = a set of program to run after original function
428+
*/
429+
int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags,
430+
struct bpf_prog **fentry_progs, int fentry_cnt,
431+
struct bpf_prog **fexit_progs, int fexit_cnt,
432+
void *orig_call);
433+
/* these two functions are called from generated trampoline */
434+
u64 notrace __bpf_prog_enter(void);
435+
void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start);
436+
437+
enum bpf_tramp_prog_type {
438+
BPF_TRAMP_FENTRY,
439+
BPF_TRAMP_FEXIT,
440+
BPF_TRAMP_MAX
441+
};
442+
443+
struct bpf_trampoline {
444+
/* hlist for trampoline_table */
445+
struct hlist_node hlist;
446+
/* serializes access to fields of this trampoline */
447+
struct mutex mutex;
448+
refcount_t refcnt;
449+
u64 key;
450+
struct {
451+
struct btf_func_model model;
452+
void *addr;
453+
} func;
454+
/* list of BPF programs using this trampoline */
455+
struct hlist_head progs_hlist[BPF_TRAMP_MAX];
456+
/* Number of attached programs. A counter per kind. */
457+
int progs_cnt[BPF_TRAMP_MAX];
458+
/* Executable image of trampoline */
459+
void *image;
460+
u64 selector;
461+
};
462+
#ifdef CONFIG_BPF_JIT
463+
struct bpf_trampoline *bpf_trampoline_lookup(u64 key);
464+
int bpf_trampoline_link_prog(struct bpf_prog *prog);
465+
int bpf_trampoline_unlink_prog(struct bpf_prog *prog);
466+
void bpf_trampoline_put(struct bpf_trampoline *tr);
467+
#else
468+
static inline struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
469+
{
470+
return NULL;
471+
}
472+
static inline int bpf_trampoline_link_prog(struct bpf_prog *prog)
473+
{
474+
return -ENOTSUPP;
475+
}
476+
static inline int bpf_trampoline_unlink_prog(struct bpf_prog *prog)
477+
{
478+
return -ENOTSUPP;
479+
}
480+
static inline void bpf_trampoline_put(struct bpf_trampoline *tr) {}
481+
#endif
482+
387483
struct bpf_prog_aux {
388484
atomic_t refcnt;
389485
u32 used_map_cnt;
@@ -398,6 +494,9 @@ struct bpf_prog_aux {
398494
bool verifier_zext; /* Zero extensions has been inserted by verifier. */
399495
bool offload_requested;
400496
bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */
497+
enum bpf_tramp_prog_type trampoline_prog_type;
498+
struct bpf_trampoline *trampoline;
499+
struct hlist_node tramp_hlist;
401500
/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
402501
const struct btf_type *attach_func_proto;
403502
/* function name for valid attach_btf_id */
@@ -784,6 +883,12 @@ int btf_struct_access(struct bpf_verifier_log *log,
784883
u32 *next_btf_id);
785884
u32 btf_resolve_helper_id(struct bpf_verifier_log *log, void *, int);
786885

886+
int btf_distill_func_proto(struct bpf_verifier_log *log,
887+
struct btf *btf,
888+
const struct btf_type *func_proto,
889+
const char *func_name,
890+
struct btf_func_model *m);
891+
787892
#else /* !CONFIG_BPF_SYSCALL */
788893
static inline struct bpf_prog *bpf_prog_get(u32 ufd)
789894
{

Diff for: include/uapi/linux/bpf.h

+2
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,8 @@ enum bpf_attach_type {
201201
BPF_CGROUP_GETSOCKOPT,
202202
BPF_CGROUP_SETSOCKOPT,
203203
BPF_TRACE_RAW_TP,
204+
BPF_TRACE_FENTRY,
205+
BPF_TRACE_FEXIT,
204206
__MAX_BPF_ATTACH_TYPE
205207
};
206208

Diff for: kernel/bpf/Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o
66
obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
77
obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o
88
obj-$(CONFIG_BPF_SYSCALL) += disasm.o
9+
obj-$(CONFIG_BPF_JIT) += trampoline.o
910
obj-$(CONFIG_BPF_SYSCALL) += btf.o
1011
ifeq ($(CONFIG_NET),y)
1112
obj-$(CONFIG_BPF_SYSCALL) += devmap.o

0 commit comments

Comments
 (0)