Skip to content

Commit d10c809

Browse files
htejunaxboe
authored andcommitted
writeback: implement foreign cgroup inode bdi_writeback switching
As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and transfers the ownership to it. The previous patches implemented the foreign condition detection mechanism and laid the groundwork. This patch implements the actual switching. With the previously implemented [unlocked_]inode_to_wb_and_list_lock() and wb stat transaction, grabbing wb->list_lock, inode->i_lock and mapping->tree_lock gives us full exclusion against all wb operations on the target inode. inode_switch_wb_work_fn() grabs all the locks and transfers the inode atomically along with its RECLAIMABLE and WRITEBACK stats. Signed-off-by: Tejun Heo <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jan Kara <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
1 parent aaa2cac commit d10c809

File tree

1 file changed

+84
-2
lines changed

1 file changed

+84
-2
lines changed

fs/fs-writeback.c

Lines changed: 84 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -322,30 +322,112 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
322322
struct inode_switch_wbs_context *isw =
323323
container_of(work, struct inode_switch_wbs_context, work);
324324
struct inode *inode = isw->inode;
325+
struct address_space *mapping = inode->i_mapping;
326+
struct bdi_writeback *old_wb = inode->i_wb;
325327
struct bdi_writeback *new_wb = isw->new_wb;
328+
struct radix_tree_iter iter;
329+
bool switched = false;
330+
void **slot;
326331

327332
/*
328333
* By the time control reaches here, RCU grace period has passed
329334
* since I_WB_SWITCH assertion and all wb stat update transactions
330335
* between unlocked_inode_to_wb_begin/end() are guaranteed to be
331336
* synchronizing against mapping->tree_lock.
337+
*
338+
* Grabbing old_wb->list_lock, inode->i_lock and mapping->tree_lock
339+
* gives us exclusion against all wb related operations on @inode
340+
* including IO list manipulations and stat updates.
332341
*/
342+
if (old_wb < new_wb) {
343+
spin_lock(&old_wb->list_lock);
344+
spin_lock_nested(&new_wb->list_lock, SINGLE_DEPTH_NESTING);
345+
} else {
346+
spin_lock(&new_wb->list_lock);
347+
spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING);
348+
}
333349
spin_lock(&inode->i_lock);
350+
spin_lock_irq(&mapping->tree_lock);
351+
352+
/*
353+
* Once I_FREEING is visible under i_lock, the eviction path owns
354+
* the inode and we shouldn't modify ->i_wb_list.
355+
*/
356+
if (unlikely(inode->i_state & I_FREEING))
357+
goto skip_switch;
334358

359+
/*
360+
* Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points
361+
* to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to
362+
* pages actually under underwriteback.
363+
*/
364+
radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0,
365+
PAGECACHE_TAG_DIRTY) {
366+
struct page *page = radix_tree_deref_slot_protected(slot,
367+
&mapping->tree_lock);
368+
if (likely(page) && PageDirty(page)) {
369+
__dec_wb_stat(old_wb, WB_RECLAIMABLE);
370+
__inc_wb_stat(new_wb, WB_RECLAIMABLE);
371+
}
372+
}
373+
374+
radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0,
375+
PAGECACHE_TAG_WRITEBACK) {
376+
struct page *page = radix_tree_deref_slot_protected(slot,
377+
&mapping->tree_lock);
378+
if (likely(page)) {
379+
WARN_ON_ONCE(!PageWriteback(page));
380+
__dec_wb_stat(old_wb, WB_WRITEBACK);
381+
__inc_wb_stat(new_wb, WB_WRITEBACK);
382+
}
383+
}
384+
385+
wb_get(new_wb);
386+
387+
/*
388+
* Transfer to @new_wb's IO list if necessary. The specific list
389+
* @inode was on is ignored and the inode is put on ->b_dirty which
390+
* is always correct including from ->b_dirty_time. The transfer
391+
* preserves @inode->dirtied_when ordering.
392+
*/
393+
if (!list_empty(&inode->i_wb_list)) {
394+
struct inode *pos;
395+
396+
inode_wb_list_del_locked(inode, old_wb);
397+
inode->i_wb = new_wb;
398+
list_for_each_entry(pos, &new_wb->b_dirty, i_wb_list)
399+
if (time_after_eq(inode->dirtied_when,
400+
pos->dirtied_when))
401+
break;
402+
inode_wb_list_move_locked(inode, new_wb, pos->i_wb_list.prev);
403+
} else {
404+
inode->i_wb = new_wb;
405+
}
406+
407+
/* ->i_wb_frn updates may race wbc_detach_inode() but doesn't matter */
335408
inode->i_wb_frn_winner = 0;
336409
inode->i_wb_frn_avg_time = 0;
337410
inode->i_wb_frn_history = 0;
338-
411+
switched = true;
412+
skip_switch:
339413
/*
340414
* Paired with load_acquire in unlocked_inode_to_wb_begin() and
341415
* ensures that the new wb is visible if they see !I_WB_SWITCH.
342416
*/
343417
smp_store_release(&inode->i_state, inode->i_state & ~I_WB_SWITCH);
344418

419+
spin_unlock_irq(&mapping->tree_lock);
345420
spin_unlock(&inode->i_lock);
421+
spin_unlock(&new_wb->list_lock);
422+
spin_unlock(&old_wb->list_lock);
346423

347-
iput(inode);
424+
if (switched) {
425+
wb_wakeup(new_wb);
426+
wb_put(old_wb);
427+
}
348428
wb_put(new_wb);
429+
430+
iput(inode);
349431
kfree(isw);
350432
}
351433

0 commit comments

Comments
 (0)