Skip to content

Commit 4caeccb

Browse files
committed
Merge branch 'xen-netback-next'
Zoltan Kiss says: ==================== xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy A long known problem of the upstream netback implementation that on the TX path (from guest to Dom0) it copies the whole packet from guest memory into Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a huge perfomance penalty. The classic kernel version of netback used grant mapping, and to get notified when the page can be unmapped, it used page destructors. Unfortunately that destructor is not an upstreamable solution. Ian Campbell's skb fragment destructor patch series [1] tried to solve this problem, however it seems to be very invasive on the network stack's code, and therefore haven't progressed very well. This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to know when the skb is freed up. That is the way KVM solved the same problem, and based on my initial tests it can do the same for us. Avoiding the extra copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node, running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb switch) Based on my investigations the packet get only copied if it is delivered to Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit unfortunate, but luckily it doesn't cause a major regression for this usecase. In the future we should try to eliminate that copy somehow. There are a few spinoff tasks which will be addressed in separate patches: - grant copy the header directly instead of map and memcpy. This should help us avoiding TLB flushing - use something else than ballooned pages - fix grant map to use page->index properly I've tried to broke it down to smaller patches, with mixed results, so I welcome suggestions on that part as well: 1: Use skb->cb to store pending_idx 2: Some refactoring 3: Change RX path for mapped SKB fragments (moved here to keep bisectability, review it after #4) 4: Introduce TX grant mapping 5: Remove old TX grant copy definitons and fix indentations 6: Add stat counters for zerocopy 7: Handle guests with too many frags 8: Timeout packets in RX path 9: Aggregate TX unmap operations v2: I've fixed some smaller things, see the individual patches. I've added a few new stat counters, and handling the important use case when an older guest sends lots of slots. Instead of delayed copy now we timeout packets on the RX path, based on the assumption that otherwise packets should get stucked anywhere else. Finally some unmap batching to avoid too much TLB flush v3: Apart from fixing a few things mentioned in responses the important change is the use the hypercall directly for grant [un]mapping, therefore we can avoid m2p override. v4: Now we are using a new grant mapping API to avoid m2p_override. The RX queue timeout logic changed also. v5: Only minor fixes based on Wei's comments v6: Important bugfixes for xenvif_poll exit path and zerocopy callback, see first 2 patches. Also rework of handling packets with too many slots, and reorder the series a bit. v7: Small fixes in comments/log messages/error paths, and merging the frag overflow stats patch into its parent. [1] http://lwn.net/Articles/491522/ [2] https://lkml.org/lkml/2012/7/20/363 ==================== Signed-off-by: Zoltan Kiss <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2 parents 31c70d5 + e9275f5 commit 4caeccb

File tree

3 files changed

+740
-289
lines changed

3 files changed

+740
-289
lines changed

drivers/net/xen-netback/common.h

Lines changed: 76 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -48,37 +48,19 @@
4848
typedef unsigned int pending_ring_idx_t;
4949
#define INVALID_PENDING_RING_IDX (~0U)
5050

51-
/* For the head field in pending_tx_info: it is used to indicate
52-
* whether this tx info is the head of one or more coalesced requests.
53-
*
54-
* When head != INVALID_PENDING_RING_IDX, it means the start of a new
55-
* tx requests queue and the end of previous queue.
56-
*
57-
* An example sequence of head fields (I = INVALID_PENDING_RING_IDX):
58-
*
59-
* ...|0 I I I|5 I|9 I I I|...
60-
* -->|<-INUSE----------------
61-
*
62-
* After consuming the first slot(s) we have:
63-
*
64-
* ...|V V V V|5 I|9 I I I|...
65-
* -----FREE->|<-INUSE--------
66-
*
67-
* where V stands for "valid pending ring index". Any number other
68-
* than INVALID_PENDING_RING_IDX is OK. These entries are considered
69-
* free and can contain any number other than
70-
* INVALID_PENDING_RING_IDX. In practice we use 0.
71-
*
72-
* The in use non-INVALID_PENDING_RING_IDX (say 0, 5 and 9 in the
73-
* above example) number is the index into pending_tx_info and
74-
* mmap_pages arrays.
75-
*/
7651
struct pending_tx_info {
77-
struct xen_netif_tx_request req; /* coalesced tx request */
78-
pending_ring_idx_t head; /* head != INVALID_PENDING_RING_IDX
79-
* if it is head of one or more tx
80-
* reqs
81-
*/
52+
struct xen_netif_tx_request req; /* tx request */
53+
/* Callback data for released SKBs. The callback is always
54+
* xenvif_zerocopy_callback, desc contains the pending_idx, which is
55+
* also an index in pending_tx_info array. It is initialized in
56+
* xenvif_alloc and it never changes.
57+
* skb_shinfo(skb)->destructor_arg points to the first mapped slot's
58+
* callback_struct in this array of struct pending_tx_info's, then ctx
59+
* to the next, or NULL if there is no more slot for this skb.
60+
* ubuf_to_vif is a helper which finds the struct xenvif from a pointer
61+
* to this field.
62+
*/
63+
struct ubuf_info callback_struct;
8264
};
8365

8466
#define XEN_NETIF_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, PAGE_SIZE)
@@ -108,6 +90,15 @@ struct xenvif_rx_meta {
10890
*/
10991
#define MAX_GRANT_COPY_OPS (MAX_SKB_FRAGS * XEN_NETIF_RX_RING_SIZE)
11092

93+
#define NETBACK_INVALID_HANDLE -1
94+
95+
/* To avoid confusion, we define XEN_NETBK_LEGACY_SLOTS_MAX indicating
96+
* the maximum slots a valid packet can use. Now this value is defined
97+
* to be XEN_NETIF_NR_SLOTS_MIN, which is supposed to be supported by
98+
* all backend.
99+
*/
100+
#define XEN_NETBK_LEGACY_SLOTS_MAX XEN_NETIF_NR_SLOTS_MIN
101+
111102
struct xenvif {
112103
/* Unique identifier for this interface. */
113104
domid_t domid;
@@ -126,13 +117,28 @@ struct xenvif {
126117
pending_ring_idx_t pending_cons;
127118
u16 pending_ring[MAX_PENDING_REQS];
128119
struct pending_tx_info pending_tx_info[MAX_PENDING_REQS];
129-
130-
/* Coalescing tx requests before copying makes number of grant
131-
* copy ops greater or equal to number of slots required. In
132-
* worst case a tx request consumes 2 gnttab_copy.
120+
grant_handle_t grant_tx_handle[MAX_PENDING_REQS];
121+
122+
struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
123+
struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
124+
/* passed to gnttab_[un]map_refs with pages under (un)mapping */
125+
struct page *pages_to_map[MAX_PENDING_REQS];
126+
struct page *pages_to_unmap[MAX_PENDING_REQS];
127+
128+
/* This prevents zerocopy callbacks to race over dealloc_ring */
129+
spinlock_t callback_lock;
130+
/* This prevents dealloc thread and NAPI instance to race over response
131+
* creation and pending_ring in xenvif_idx_release. In xenvif_tx_err
132+
* it only protect response creation
133133
*/
134-
struct gnttab_copy tx_copy_ops[2*MAX_PENDING_REQS];
135-
134+
spinlock_t response_lock;
135+
pending_ring_idx_t dealloc_prod;
136+
pending_ring_idx_t dealloc_cons;
137+
u16 dealloc_ring[MAX_PENDING_REQS];
138+
struct task_struct *dealloc_task;
139+
wait_queue_head_t dealloc_wq;
140+
struct timer_list dealloc_delay;
141+
bool dealloc_delay_timed_out;
136142

137143
/* Use kthread for guest RX */
138144
struct task_struct *task;
@@ -144,6 +150,9 @@ struct xenvif {
144150
struct xen_netif_rx_back_ring rx;
145151
struct sk_buff_head rx_queue;
146152
RING_IDX rx_last_skb_slots;
153+
bool rx_queue_purge;
154+
155+
struct timer_list wake_queue;
147156

148157
/* This array is allocated seperately as it is large */
149158
struct gnttab_copy *grant_copy_op;
@@ -175,6 +184,10 @@ struct xenvif {
175184

176185
/* Statistics */
177186
unsigned long rx_gso_checksum_fixup;
187+
unsigned long tx_zerocopy_sent;
188+
unsigned long tx_zerocopy_success;
189+
unsigned long tx_zerocopy_fail;
190+
unsigned long tx_frag_overflow;
178191

179192
/* Miscellaneous private stuff. */
180193
struct net_device *dev;
@@ -216,16 +229,42 @@ void xenvif_carrier_off(struct xenvif *vif);
216229

217230
int xenvif_tx_action(struct xenvif *vif, int budget);
218231

219-
int xenvif_kthread(void *data);
232+
int xenvif_kthread_guest_rx(void *data);
220233
void xenvif_kick_thread(struct xenvif *vif);
221234

235+
int xenvif_dealloc_kthread(void *data);
236+
222237
/* Determine whether the needed number of slots (req) are available,
223238
* and set req_event if not.
224239
*/
225240
bool xenvif_rx_ring_slots_available(struct xenvif *vif, int needed);
226241

227242
void xenvif_stop_queue(struct xenvif *vif);
228243

244+
/* Callback from stack when TX packet can be released */
245+
void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success);
246+
247+
/* Unmap a pending page and release it back to the guest */
248+
void xenvif_idx_unmap(struct xenvif *vif, u16 pending_idx);
249+
250+
static inline pending_ring_idx_t nr_pending_reqs(struct xenvif *vif)
251+
{
252+
return MAX_PENDING_REQS -
253+
vif->pending_prod + vif->pending_cons;
254+
}
255+
256+
static inline bool xenvif_tx_pending_slots_available(struct xenvif *vif)
257+
{
258+
return nr_pending_reqs(vif) + XEN_NETBK_LEGACY_SLOTS_MAX
259+
< MAX_PENDING_REQS;
260+
}
261+
262+
/* Callback from stack when TX packet can be released */
263+
void xenvif_zerocopy_callback(struct ubuf_info *ubuf, bool zerocopy_success);
264+
229265
extern bool separate_tx_rx_irq;
230266

267+
extern unsigned int rx_drain_timeout_msecs;
268+
extern unsigned int rx_drain_timeout_jiffies;
269+
231270
#endif /* __XEN_NETBACK__COMMON_H__ */

0 commit comments

Comments
 (0)