Skip to content

Commit 6ac99e8

Browse files
iamkafaiAlexei Starovoitov
authored and
Alexei Starovoitov
committed
bpf: Introduce bpf sk local storage
After allowing a bpf prog to - directly read the skb->sk ptr - get the fullsock bpf_sock by "bpf_sk_fullsock()" - get the bpf_tcp_sock by "bpf_tcp_sock()" - get the listener sock by "bpf_get_listener_sock()" - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock" into different bpf running context. this patch is another effort to make bpf's network programming more intuitive to do (together with memory and performance benefit). When bpf prog needs to store data for a sk, the current practice is to define a map with the usual 4-tuples (src/dst ip/port) as the key. If multiple bpf progs require to store different sk data, multiple maps have to be defined. Hence, wasting memory to store the duplicated keys (i.e. 4 tuples here) in each of the bpf map. [ The smallest key could be the sk pointer itself which requires some enhancement in the verifier and it is a separate topic. ] Also, the bpf prog needs to clean up the elem when sk is freed. Otherwise, the bpf map will become full and un-usable quickly. The sk-free tracking currently could be done during sk state transition (e.g. BPF_SOCK_OPS_STATE_CB). The size of the map needs to be predefined which then usually ended-up with an over-provisioned map in production. Even the map was re-sizable, while the sk naturally come and go away already, this potential re-size operation is arguably redundant if the data can be directly connected to the sk itself instead of proxy-ing through a bpf map. This patch introduces sk->sk_bpf_storage to provide local storage space at sk for bpf prog to use. The space will be allocated when the first bpf prog has created data for this particular sk. The design optimizes the bpf prog's lookup (and then optionally followed by an inline update). bpf_spin_lock should be used if the inline update needs to be protected. BPF_MAP_TYPE_SK_STORAGE: ----------------------- To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can be created to fit different bpf progs' needs. The map enforces BTF to allow printing the sk-local-storage during a system-wise sk dump (e.g. "ss -ta") in the future. The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete a "sk-local-storage" data from a particular sk. Think of the map as a meta-data (or "type") of a "sk-local-storage". This particular "type" of "sk-local-storage" data can then be stored in any sk. The main purposes of this map are mostly: 1. Define the size of a "sk-local-storage" type. 2. Provide a similar syscall userspace API as the map (e.g. lookup/update, map-id, map-btf...etc.) 3. Keep track of all sk's storages of this "type" and clean them up when the map is freed. sk->sk_bpf_storage: ------------------ The main lookup/update/delete is done on sk->sk_bpf_storage (which is a "struct bpf_sk_storage"). When doing a lookup, the "map" pointer is now used as the "key" to search on the sk_storage->list. The "map" pointer is actually serving as the "type" of the "sk-local-storage" that is being requested. To allow very fast lookup, it should be as fast as looking up an array at a stable-offset. At the same time, it is not ideal to set a hard limit on the number of sk-local-storage "type" that the system can have. Hence, this patch takes a cache approach. The last search result from sk_storage->list is cached in sk_storage->cache[] which is a stable sized array. Each "sk-local-storage" type has a stable offset to the cache[] array. In the future, a map's flag could be introduced to do cache opt-out/enforcement if it became necessary. The cache size is 16 (i.e. 16 types of "sk-local-storage"). Programs can share map. On the program side, having a few bpf_progs running in the networking hotpath is already a lot. The bpf_prog should have already consolidated the existing sock-key-ed map usage to minimize the map lookup penalty. 16 has enough runway to grow. All sk-local-storage data will be removed from sk->sk_bpf_storage during sk destruction. bpf_sk_storage_get() and bpf_sk_storage_delete(): ------------------------------------------------ Instead of using bpf_map_(lookup|update|delete)_elem(), the bpf prog needs to use the new helper bpf_sk_storage_get() and bpf_sk_storage_delete(). The verifier can then enforce the ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to "create" new elem if one does not exist in the sk. It is done by the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE. The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together, it has eliminated the potential use cases for an equivalent bpf_map_update_elem() API (for bpf_prog) in this patch. Misc notes: ---------- 1. map_get_next_key is not supported. From the userspace syscall perspective, the map has the socket fd as the key while the map can be shared by pinned-file or map-id. Since btf is enforced, the existing "ss" could be enhanced to pretty print the local-storage. Supporting a kernel defined btf with 4 tuples as the return key could be explored later also. 2. The sk->sk_lock cannot be acquired. Atomic operations is used instead. e.g. cmpxchg is done on the sk->sk_bpf_storage ptr. Please refer to the source code comments for the details in synchronization cases and considerations. 3. The mem is charged to the sk->sk_omem_alloc as the sk filter does. Benchmark: --------- Here is the benchmark data collected by turning on the "kernel.bpf_stats_enabled" sysctl. Two bpf progs are tested: One bpf prog with the usual bpf hashmap (max_entries = 8192) with the sk ptr as the key. (verifier is modified to support sk ptr as the key That should have shortened the key lookup time.) Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE. Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for each egress skb and then bump the cnt. netperf is used to drive data with 4096 connected UDP sockets. BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run) 27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633 loaded_at 2019-04-15T13:46:39-0700 uid 0 xlated 344B jited 258B memlock 4096B map_ids 16 btf_id 5 BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run) 30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739 loaded_at 2019-04-15T13:47:54-0700 uid 0 xlated 168B jited 156B memlock 4096B map_ids 17 btf_id 6 Here is a high-level picture on how are the objects organized: sk ┌──────┐ │ │ │ │ │ │ │*sk_bpf_storage─────▶ bpf_sk_storage └──────┘ ┌───────┐ ┌───────────┤ list │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ │ elem │ ┌────────┐ ├─▶│ snode │ │ ├────────┤ │ │ data │ bpf_map │ ├────────┤ ┌─────────┐ │ │map_node│◀─┬─────┤ list │ │ └────────┘ │ │ │ │ │ │ │ │ elem │ │ │ │ ┌────────┐ │ └─────────┘ └─▶│ snode │ │ ├────────┤ │ bpf_map │ data │ │ ┌─────────┐ ├────────┤ │ │ list ├───────▶│map_node│ │ │ │ └────────┘ │ │ │ │ │ │ elem │ └─────────┘ ┌────────┐ │ ┌─▶│ snode │ │ │ ├────────┤ │ │ │ data │ │ │ ├────────┤ │ │ │map_node│◀─┘ │ └────────┘ │ │ │ ┌───────┐ sk └──────────│ list │ ┌──────┐ │ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │*sk_bpf_storage───────▶bpf_sk_storage └──────┘ Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
1 parent 3745dc2 commit 6ac99e8

File tree

12 files changed

+914
-5
lines changed

12 files changed

+914
-5
lines changed

Diff for: include/linux/bpf.h

+2
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,7 @@ enum bpf_arg_type {
184184
ARG_PTR_TO_MAP_KEY, /* pointer to stack used as map key */
185185
ARG_PTR_TO_MAP_VALUE, /* pointer to stack used as map value */
186186
ARG_PTR_TO_UNINIT_MAP_VALUE, /* pointer to valid memory used to store a map value */
187+
ARG_PTR_TO_MAP_VALUE_OR_NULL, /* pointer to stack used as map value or NULL */
187188

188189
/* the following constraints used to prototype bpf_memcmp() and other
189190
* functions that access data on eBPF program stack
@@ -204,6 +205,7 @@ enum bpf_arg_type {
204205
ARG_PTR_TO_SOCK_COMMON, /* pointer to sock_common */
205206
ARG_PTR_TO_INT, /* pointer to int */
206207
ARG_PTR_TO_LONG, /* pointer to long */
208+
ARG_PTR_TO_SOCKET, /* pointer to bpf_sock (fullsock) */
207209
};
208210

209211
/* type of values returned from helper functions */

Diff for: include/linux/bpf_types.h

+1
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops)
6161
BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
6262
#ifdef CONFIG_NET
6363
BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
64+
BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops)
6465
#if defined(CONFIG_BPF_STREAM_PARSER)
6566
BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
6667
BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)

Diff for: include/net/bpf_sk_storage.h

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
/* SPDX-License-Identifier: GPL-2.0 */
2+
/* Copyright (c) 2019 Facebook */
3+
#ifndef _BPF_SK_STORAGE_H
4+
#define _BPF_SK_STORAGE_H
5+
6+
struct sock;
7+
8+
void bpf_sk_storage_free(struct sock *sk);
9+
10+
extern const struct bpf_func_proto bpf_sk_storage_get_proto;
11+
extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
12+
13+
#endif /* _BPF_SK_STORAGE_H */

Diff for: include/net/sock.h

+5
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,8 @@ struct sock_common {
236236
/* public: */
237237
};
238238

239+
struct bpf_sk_storage;
240+
239241
/**
240242
* struct sock - network layer representation of sockets
241243
* @__sk_common: shared layout with inet_timewait_sock
@@ -510,6 +512,9 @@ struct sock {
510512
#endif
511513
void (*sk_destruct)(struct sock *sk);
512514
struct sock_reuseport __rcu *sk_reuseport_cb;
515+
#ifdef CONFIG_BPF_SYSCALL
516+
struct bpf_sk_storage __rcu *sk_bpf_storage;
517+
#endif
513518
struct rcu_head sk_rcu;
514519
};
515520

Diff for: include/uapi/linux/bpf.h

+43-1
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ enum bpf_map_type {
133133
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
134134
BPF_MAP_TYPE_QUEUE,
135135
BPF_MAP_TYPE_STACK,
136+
BPF_MAP_TYPE_SK_STORAGE,
136137
};
137138

138139
/* Note that tracing related programs such as
@@ -2630,6 +2631,42 @@ union bpf_attr {
26302631
* was provided.
26312632
*
26322633
* **-ERANGE** if resulting value was out of range.
2634+
*
2635+
* void *bpf_sk_storage_get(struct bpf_map *map, struct bpf_sock *sk, void *value, u64 flags)
2636+
* Description
2637+
* Get a bpf-local-storage from a sk.
2638+
*
2639+
* Logically, it could be thought of getting the value from
2640+
* a *map* with *sk* as the **key**. From this
2641+
* perspective, the usage is not much different from
2642+
* **bpf_map_lookup_elem(map, &sk)** except this
2643+
* helper enforces the key must be a **bpf_fullsock()**
2644+
* and the map must be a BPF_MAP_TYPE_SK_STORAGE also.
2645+
*
2646+
* Underneath, the value is stored locally at *sk* instead of
2647+
* the map. The *map* is used as the bpf-local-storage **type**.
2648+
* The bpf-local-storage **type** (i.e. the *map*) is searched
2649+
* against all bpf-local-storages residing at sk.
2650+
*
2651+
* An optional *flags* (BPF_SK_STORAGE_GET_F_CREATE) can be
2652+
* used such that a new bpf-local-storage will be
2653+
* created if one does not exist. *value* can be used
2654+
* together with BPF_SK_STORAGE_GET_F_CREATE to specify
2655+
* the initial value of a bpf-local-storage. If *value* is
2656+
* NULL, the new bpf-local-storage will be zero initialized.
2657+
* Return
2658+
* A bpf-local-storage pointer is returned on success.
2659+
*
2660+
* **NULL** if not found or there was an error in adding
2661+
* a new bpf-local-storage.
2662+
*
2663+
* int bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk)
2664+
* Description
2665+
* Delete a bpf-local-storage from a sk.
2666+
* Return
2667+
* 0 on success.
2668+
*
2669+
* **-ENOENT** if the bpf-local-storage cannot be found.
26332670
*/
26342671
#define __BPF_FUNC_MAPPER(FN) \
26352672
FN(unspec), \
@@ -2738,7 +2775,9 @@ union bpf_attr {
27382775
FN(sysctl_get_new_value), \
27392776
FN(sysctl_set_new_value), \
27402777
FN(strtol), \
2741-
FN(strtoul),
2778+
FN(strtoul), \
2779+
FN(sk_storage_get), \
2780+
FN(sk_storage_delete),
27422781

27432782
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
27442783
* function eBPF program intends to call
@@ -2814,6 +2853,9 @@ enum bpf_func_id {
28142853
/* BPF_FUNC_sysctl_get_name flags. */
28152854
#define BPF_F_SYSCTL_BASE_NAME (1ULL << 0)
28162855

2856+
/* BPF_FUNC_sk_storage_get flags */
2857+
#define BPF_SK_STORAGE_GET_F_CREATE (1ULL << 0)
2858+
28172859
/* Mode for BPF_FUNC_skb_adjust_room helper. */
28182860
enum bpf_adj_room_mode {
28192861
BPF_ADJ_ROOM_NET,

Diff for: kernel/bpf/syscall.c

+2-1
Original file line numberDiff line numberDiff line change
@@ -526,7 +526,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
526526
return -EACCES;
527527
if (map->map_type != BPF_MAP_TYPE_HASH &&
528528
map->map_type != BPF_MAP_TYPE_ARRAY &&
529-
map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE)
529+
map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
530+
map->map_type != BPF_MAP_TYPE_SK_STORAGE)
530531
return -ENOTSUPP;
531532
if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
532533
map->value_size) {

Diff for: kernel/bpf/verifier.c

+24-3
Original file line numberDiff line numberDiff line change
@@ -2543,10 +2543,15 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
25432543

25442544
if (arg_type == ARG_PTR_TO_MAP_KEY ||
25452545
arg_type == ARG_PTR_TO_MAP_VALUE ||
2546-
arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE) {
2546+
arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE ||
2547+
arg_type == ARG_PTR_TO_MAP_VALUE_OR_NULL) {
25472548
expected_type = PTR_TO_STACK;
2548-
if (!type_is_pkt_pointer(type) && type != PTR_TO_MAP_VALUE &&
2549-
type != expected_type)
2549+
if (register_is_null(reg) &&
2550+
arg_type == ARG_PTR_TO_MAP_VALUE_OR_NULL)
2551+
/* final test in check_stack_boundary() */;
2552+
else if (!type_is_pkt_pointer(type) &&
2553+
type != PTR_TO_MAP_VALUE &&
2554+
type != expected_type)
25502555
goto err_type;
25512556
} else if (arg_type == ARG_CONST_SIZE ||
25522557
arg_type == ARG_CONST_SIZE_OR_ZERO) {
@@ -2578,6 +2583,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
25782583
}
25792584
meta->ref_obj_id = reg->ref_obj_id;
25802585
}
2586+
} else if (arg_type == ARG_PTR_TO_SOCKET) {
2587+
expected_type = PTR_TO_SOCKET;
2588+
if (type != expected_type)
2589+
goto err_type;
25812590
} else if (arg_type == ARG_PTR_TO_SPIN_LOCK) {
25822591
if (meta->func_id == BPF_FUNC_spin_lock) {
25832592
if (process_spin_lock(env, regno, true))
@@ -2635,6 +2644,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
26352644
meta->map_ptr->key_size, false,
26362645
NULL);
26372646
} else if (arg_type == ARG_PTR_TO_MAP_VALUE ||
2647+
(arg_type == ARG_PTR_TO_MAP_VALUE_OR_NULL &&
2648+
!register_is_null(reg)) ||
26382649
arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE) {
26392650
/* bpf_map_xxx(..., map_ptr, ..., value) call:
26402651
* check [value, value + map->value_size) validity
@@ -2784,6 +2795,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
27842795
func_id != BPF_FUNC_map_push_elem)
27852796
goto error;
27862797
break;
2798+
case BPF_MAP_TYPE_SK_STORAGE:
2799+
if (func_id != BPF_FUNC_sk_storage_get &&
2800+
func_id != BPF_FUNC_sk_storage_delete)
2801+
goto error;
2802+
break;
27872803
default:
27882804
break;
27892805
}
@@ -2847,6 +2863,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
28472863
map->map_type != BPF_MAP_TYPE_STACK)
28482864
goto error;
28492865
break;
2866+
case BPF_FUNC_sk_storage_get:
2867+
case BPF_FUNC_sk_storage_delete:
2868+
if (map->map_type != BPF_MAP_TYPE_SK_STORAGE)
2869+
goto error;
2870+
break;
28502871
default:
28512872
break;
28522873
}

Diff for: net/bpf/test_run.c

+2
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
#include <linux/etherdevice.h>
1111
#include <linux/filter.h>
1212
#include <linux/sched/signal.h>
13+
#include <net/bpf_sk_storage.h>
1314
#include <net/sock.h>
1415
#include <net/tcp.h>
1516

@@ -335,6 +336,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
335336
sizeof(struct __sk_buff));
336337
out:
337338
kfree_skb(skb);
339+
bpf_sk_storage_free(sk);
338340
kfree(sk);
339341
kfree(ctx);
340342
return ret;

Diff for: net/core/Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,4 @@ obj-$(CONFIG_HWBM) += hwbm.o
3434
obj-$(CONFIG_NET_DEVLINK) += devlink.o
3535
obj-$(CONFIG_GRO_CELLS) += gro_cells.o
3636
obj-$(CONFIG_FAILOVER) += failover.o
37+
obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o

0 commit comments

Comments
 (0)