Skip to content

Commit 1f211a1

Browse files
borkmanndavem330
authored andcommitted
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding only classifiers. The clsact qdisc works on ingress, but also on egress. In both cases, it's execution happens without taking the qdisc lock, and the main difference for the egress part compared to prior version of [1] is that this can be applied with _any_ underlying real egress qdisc (also classless ones). Besides solving the use-case of [1], that is, allowing for more programmability on assigning skb->priority for the mqprio case that is supported by most popular 10G+ NICs, it also opens up a lot more flexibility for other tc applications. The main work on classification can already be done at clsact egress time if the use-case allows and state stored for later retrieval f.e. again in skb->priority with major/minors (which is checked by most classful qdiscs before consulting tc_classify()) and/or in other skb fields like skb->tc_index for some light-weight post-processing to get to the eventual classid in case of a classful qdisc. Another use case is that the clsact egress part allows to have a central egress counterpart to the ingress classifiers, so that classifiers can easily share state (e.g. in cls_bpf via eBPF maps) for ingress and egress. Currently, default setups like mq + pfifo_fast would require for this to use, for example, prio qdisc instead (to get a tc_classify() run) and to duplicate the egress classifier for each queue. With clsact, it allows for leaving the setup as is, it can additionally assign skb->priority to put the skb in one of pfifo_fast's bands and it can share state with maps. Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid) w/o the need to perform a skb_dst_force() to hold on to it any longer. In lwt case, we can also use this facility to setup dst metadata via cls_bpf (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for that (case of IFF_NO_QUEUE devices, for example). The realization can be done without any changes to the scheduler core framework. All it takes is that we have two a-priori defined minors/child classes, where we can mux between ingress and egress classifier list (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to dev->_tx to avoid extra cacheline miss for moderate loads). The egress part is a bit similar modelled to handle_ing() and patched to a noop in case the functionality is not used. Both handlers are now called sch_handle_ingress() and sch_handle_egress(), code sharing among the two doesn't seem practical as there are various minor differences in both paths, so that making them conditional in a single handler would rather slow things down. Full compatibility to ingress qdisc is provided as well. Since both piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist per netdevice, and thus ingress qdisc specific behaviour can be retained for user space. This means, either a user does 'tc qdisc add dev foo ingress' and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact' alternative, where both, ingress and egress classifier can be configured as in the below example. ingress qdisc supports attaching classifier to any minor number whereas clsact has two fixed minors for muxing between the lists, therefore to not break user space setups, they are better done as two separate qdiscs. I decided to extend the sch_ingress module with clsact functionality so that commonly used code can be reused, the module is being aliased with sch_clsact so that it can be auto-loaded properly. Alternative would have been to add a flag when initializing ingress to alter its behaviour plus aliasing to a different name (as it's more than just ingress). However, the first would end up, based on the flag, choosing the new/old behaviour by calling different function implementations to handle each anyway, the latter would require to register ingress qdisc once again under different alias. So, this really begs to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops by its own that share callbacks used by both. Example, adding qdisc: # tc qdisc add dev foo clsact # tc qdisc show dev foo qdisc mq 0: root qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc clsact ffff: parent ffff:fff1 Adding filters (deleting, etc works analogous by specifying ingress/egress): # tc filter add dev foo ingress bpf da obj bar.o sec ingress # tc filter add dev foo egress bpf da obj bar.o sec egress # tc filter show dev foo ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action # tc filter show dev foo egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will show an empty list for clsact. Either using the parent names (ingress/egress) or specifying the full major/minor will then show the related filter lists. Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend. [1] http://patchwork.ozlabs.org/patch/512949/ Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent ede5599 commit 1f211a1

File tree

8 files changed

+186
-16
lines changed

8 files changed

+186
-16
lines changed

Diff for: include/linux/netdevice.h

+3-1
Original file line numberDiff line numberDiff line change
@@ -1739,7 +1739,9 @@ struct net_device {
17391739
#ifdef CONFIG_XPS
17401740
struct xps_dev_maps __rcu *xps_maps;
17411741
#endif
1742-
1742+
#ifdef CONFIG_NET_CLS_ACT
1743+
struct tcf_proto __rcu *egress_cl_list;
1744+
#endif
17431745
#ifdef CONFIG_NET_SWITCHDEV
17441746
u32 offload_fwd_mark;
17451747
#endif

Diff for: include/linux/rtnetlink.h

+5
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,11 @@ void net_inc_ingress_queue(void);
8484
void net_dec_ingress_queue(void);
8585
#endif
8686

87+
#ifdef CONFIG_NET_EGRESS
88+
void net_inc_egress_queue(void);
89+
void net_dec_egress_queue(void);
90+
#endif
91+
8792
extern void rtnetlink_init(void);
8893
extern void __rtnl_unlock(void);
8994

Diff for: include/uapi/linux/pkt_sched.h

+4
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@ struct tc_estimator {
7272
#define TC_H_UNSPEC (0U)
7373
#define TC_H_ROOT (0xFFFFFFFFU)
7474
#define TC_H_INGRESS (0xFFFFFFF1U)
75+
#define TC_H_CLSACT TC_H_INGRESS
76+
77+
#define TC_H_MIN_INGRESS 0xFFF2U
78+
#define TC_H_MIN_EGRESS 0xFFF3U
7579

7680
/* Need to corrospond to iproute2 tc/tc_core.h "enum link_layer" */
7781
enum tc_link_layer {

Diff for: net/Kconfig

+3
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@ config COMPAT_NETLINK_MESSAGES
4848
config NET_INGRESS
4949
bool
5050

51+
config NET_EGRESS
52+
bool
53+
5154
menu "Networking options"
5255

5356
source "net/packet/Kconfig"

Diff for: net/core/dev.c

+74-8
Original file line numberDiff line numberDiff line change
@@ -1676,6 +1676,22 @@ void net_dec_ingress_queue(void)
16761676
EXPORT_SYMBOL_GPL(net_dec_ingress_queue);
16771677
#endif
16781678

1679+
#ifdef CONFIG_NET_EGRESS
1680+
static struct static_key egress_needed __read_mostly;
1681+
1682+
void net_inc_egress_queue(void)
1683+
{
1684+
static_key_slow_inc(&egress_needed);
1685+
}
1686+
EXPORT_SYMBOL_GPL(net_inc_egress_queue);
1687+
1688+
void net_dec_egress_queue(void)
1689+
{
1690+
static_key_slow_dec(&egress_needed);
1691+
}
1692+
EXPORT_SYMBOL_GPL(net_dec_egress_queue);
1693+
#endif
1694+
16791695
static struct static_key netstamp_needed __read_mostly;
16801696
#ifdef HAVE_JUMP_LABEL
16811697
/* We are not allowed to call static_key_slow_dec() from irq context
@@ -3007,7 +3023,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
30073023
bool contended;
30083024
int rc;
30093025

3010-
qdisc_pkt_len_init(skb);
30113026
qdisc_calculate_pkt_len(skb, q);
30123027
/*
30133028
* Heuristic to force contended enqueues to serialize on a
@@ -3100,6 +3115,49 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
31003115
}
31013116
EXPORT_SYMBOL(dev_loopback_xmit);
31023117

3118+
#ifdef CONFIG_NET_EGRESS
3119+
static struct sk_buff *
3120+
sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
3121+
{
3122+
struct tcf_proto *cl = rcu_dereference_bh(dev->egress_cl_list);
3123+
struct tcf_result cl_res;
3124+
3125+
if (!cl)
3126+
return skb;
3127+
3128+
/* skb->tc_verd and qdisc_skb_cb(skb)->pkt_len were already set
3129+
* earlier by the caller.
3130+
*/
3131+
qdisc_bstats_cpu_update(cl->q, skb);
3132+
3133+
switch (tc_classify(skb, cl, &cl_res, false)) {
3134+
case TC_ACT_OK:
3135+
case TC_ACT_RECLASSIFY:
3136+
skb->tc_index = TC_H_MIN(cl_res.classid);
3137+
break;
3138+
case TC_ACT_SHOT:
3139+
qdisc_qstats_cpu_drop(cl->q);
3140+
*ret = NET_XMIT_DROP;
3141+
goto drop;
3142+
case TC_ACT_STOLEN:
3143+
case TC_ACT_QUEUED:
3144+
*ret = NET_XMIT_SUCCESS;
3145+
drop:
3146+
kfree_skb(skb);
3147+
return NULL;
3148+
case TC_ACT_REDIRECT:
3149+
/* No need to push/pop skb's mac_header here on egress! */
3150+
skb_do_redirect(skb);
3151+
*ret = NET_XMIT_SUCCESS;
3152+
return NULL;
3153+
default:
3154+
break;
3155+
}
3156+
3157+
return skb;
3158+
}
3159+
#endif /* CONFIG_NET_EGRESS */
3160+
31033161
static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
31043162
{
31053163
#ifdef CONFIG_XPS
@@ -3226,6 +3284,17 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
32263284

32273285
skb_update_prio(skb);
32283286

3287+
qdisc_pkt_len_init(skb);
3288+
#ifdef CONFIG_NET_CLS_ACT
3289+
skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
3290+
# ifdef CONFIG_NET_EGRESS
3291+
if (static_key_false(&egress_needed)) {
3292+
skb = sch_handle_egress(skb, &rc, dev);
3293+
if (!skb)
3294+
goto out;
3295+
}
3296+
# endif
3297+
#endif
32293298
/* If device/qdisc don't need skb->dst, release it right now while
32303299
* its hot in this cpu cache.
32313300
*/
@@ -3247,9 +3316,6 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
32473316
txq = netdev_pick_tx(dev, skb, accel_priv);
32483317
q = rcu_dereference_bh(txq->qdisc);
32493318

3250-
#ifdef CONFIG_NET_CLS_ACT
3251-
skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
3252-
#endif
32533319
trace_net_dev_queue(skb);
32543320
if (q->enqueue) {
32553321
rc = __dev_xmit_skb(skb, q, dev, txq);
@@ -3806,9 +3872,9 @@ int (*br_fdb_test_addr_hook)(struct net_device *dev,
38063872
EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
38073873
#endif
38083874

3809-
static inline struct sk_buff *handle_ing(struct sk_buff *skb,
3810-
struct packet_type **pt_prev,
3811-
int *ret, struct net_device *orig_dev)
3875+
static inline struct sk_buff *
3876+
sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
3877+
struct net_device *orig_dev)
38123878
{
38133879
#ifdef CONFIG_NET_CLS_ACT
38143880
struct tcf_proto *cl = rcu_dereference_bh(skb->dev->ingress_cl_list);
@@ -4002,7 +4068,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
40024068
skip_taps:
40034069
#ifdef CONFIG_NET_INGRESS
40044070
if (static_key_false(&ingress_needed)) {
4005-
skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
4071+
skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev);
40064072
if (!skb)
40074073
goto out;
40084074

Diff for: net/sched/Kconfig

+10-4
Original file line numberDiff line numberDiff line change
@@ -310,15 +310,21 @@ config NET_SCH_PIE
310310
If unsure, say N.
311311

312312
config NET_SCH_INGRESS
313-
tristate "Ingress Qdisc"
313+
tristate "Ingress/classifier-action Qdisc"
314314
depends on NET_CLS_ACT
315315
select NET_INGRESS
316+
select NET_EGRESS
316317
---help---
317-
Say Y here if you want to use classifiers for incoming packets.
318+
Say Y here if you want to use classifiers for incoming and/or outgoing
319+
packets. This qdisc doesn't do anything else besides running classifiers,
320+
which can also have actions attached to them. In case of outgoing packets,
321+
classifiers that this qdisc holds are executed in the transmit path
322+
before real enqueuing to an egress qdisc happens.
323+
318324
If unsure, say Y.
319325

320-
To compile this code as a module, choose M here: the
321-
module will be called sch_ingress.
326+
To compile this code as a module, choose M here: the module will be
327+
called sch_ingress with alias of sch_clsact.
322328

323329
config NET_SCH_PLUG
324330
tristate "Plug network traffic until release (PLUG)"

Diff for: net/sched/cls_bpf.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -291,7 +291,7 @@ static int cls_bpf_prog_from_efd(struct nlattr **tb, struct cls_bpf_prog *prog,
291291
prog->bpf_name = name;
292292
prog->filter = fp;
293293

294-
if (fp->dst_needed)
294+
if (fp->dst_needed && !(tp->q->flags & TCQ_F_INGRESS))
295295
netif_keep_dst(qdisc_dev(tp->q));
296296

297297
return 0;

Diff for: net/sched/sch_ingress.c

+86-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
/* net/sched/sch_ingress.c - Ingress qdisc
1+
/* net/sched/sch_ingress.c - Ingress and clsact qdisc
2+
*
23
* This program is free software; you can redistribute it and/or
34
* modify it under the terms of the GNU General Public License
45
* as published by the Free Software Foundation; either version
@@ -98,17 +99,100 @@ static struct Qdisc_ops ingress_qdisc_ops __read_mostly = {
9899
.owner = THIS_MODULE,
99100
};
100101

102+
static unsigned long clsact_get(struct Qdisc *sch, u32 classid)
103+
{
104+
switch (TC_H_MIN(classid)) {
105+
case TC_H_MIN(TC_H_MIN_INGRESS):
106+
case TC_H_MIN(TC_H_MIN_EGRESS):
107+
return TC_H_MIN(classid);
108+
default:
109+
return 0;
110+
}
111+
}
112+
113+
static unsigned long clsact_bind_filter(struct Qdisc *sch,
114+
unsigned long parent, u32 classid)
115+
{
116+
return clsact_get(sch, classid);
117+
}
118+
119+
static struct tcf_proto __rcu **clsact_find_tcf(struct Qdisc *sch,
120+
unsigned long cl)
121+
{
122+
struct net_device *dev = qdisc_dev(sch);
123+
124+
switch (cl) {
125+
case TC_H_MIN(TC_H_MIN_INGRESS):
126+
return &dev->ingress_cl_list;
127+
case TC_H_MIN(TC_H_MIN_EGRESS):
128+
return &dev->egress_cl_list;
129+
default:
130+
return NULL;
131+
}
132+
}
133+
134+
static int clsact_init(struct Qdisc *sch, struct nlattr *opt)
135+
{
136+
net_inc_ingress_queue();
137+
net_inc_egress_queue();
138+
139+
sch->flags |= TCQ_F_CPUSTATS;
140+
141+
return 0;
142+
}
143+
144+
static void clsact_destroy(struct Qdisc *sch)
145+
{
146+
struct net_device *dev = qdisc_dev(sch);
147+
148+
tcf_destroy_chain(&dev->ingress_cl_list);
149+
tcf_destroy_chain(&dev->egress_cl_list);
150+
151+
net_dec_ingress_queue();
152+
net_dec_egress_queue();
153+
}
154+
155+
static const struct Qdisc_class_ops clsact_class_ops = {
156+
.leaf = ingress_leaf,
157+
.get = clsact_get,
158+
.put = ingress_put,
159+
.walk = ingress_walk,
160+
.tcf_chain = clsact_find_tcf,
161+
.bind_tcf = clsact_bind_filter,
162+
.unbind_tcf = ingress_put,
163+
};
164+
165+
static struct Qdisc_ops clsact_qdisc_ops __read_mostly = {
166+
.cl_ops = &clsact_class_ops,
167+
.id = "clsact",
168+
.init = clsact_init,
169+
.destroy = clsact_destroy,
170+
.dump = ingress_dump,
171+
.owner = THIS_MODULE,
172+
};
173+
101174
static int __init ingress_module_init(void)
102175
{
103-
return register_qdisc(&ingress_qdisc_ops);
176+
int ret;
177+
178+
ret = register_qdisc(&ingress_qdisc_ops);
179+
if (!ret) {
180+
ret = register_qdisc(&clsact_qdisc_ops);
181+
if (ret)
182+
unregister_qdisc(&ingress_qdisc_ops);
183+
}
184+
185+
return ret;
104186
}
105187

106188
static void __exit ingress_module_exit(void)
107189
{
108190
unregister_qdisc(&ingress_qdisc_ops);
191+
unregister_qdisc(&clsact_qdisc_ops);
109192
}
110193

111194
module_init(ingress_module_init);
112195
module_exit(ingress_module_exit);
113196

197+
MODULE_ALIAS("sch_clsact");
114198
MODULE_LICENSE("GPL");

0 commit comments

Comments
 (0)