Skip to content

Commit f3a5367

Browse files
adam900710kdave
authored andcommitted
btrfs: protect folio::private when attaching extent buffer folios
[BUG] Since v6.8 there are rare kernel crashes reported by various people, the common factor is bad page status error messages like this: BUG: Bad page state in process kswapd0 pfn:d6e840 page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c pfn:0xd6e840 aops:btree_aops ino:1 flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff) page_type: 0xffffffff() raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0 raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: non-NULL mapping [CAUSE] Commit 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") changes the sequence when allocating a new extent buffer. Previously we always called grab_extent_buffer() under mapping->i_private_lock, to ensure the safety on modification on folio::private (which is a pointer to extent buffer for regular sectorsize). This can lead to the following race: Thread A is trying to allocate an extent buffer at bytenr X, with 4 4K pages, meanwhile thread B is trying to release the page at X + 4K (the second page of the extent buffer at X). Thread A | Thread B -----------------------------------+------------------------------------- | btree_release_folio() | | This is for the page at X + 4K, | | Not page X. | | alloc_extent_buffer() | |- release_extent_buffer() |- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs) | page at bytenr X (the first | | | | page). | | | | Which returned -EEXIST. | | | | | | | |- filemap_lock_folio() | | | | Returned the first page locked. | | | | | | | |- grab_extent_buffer() | | | | |- atomic_inc_not_zero() | | | | | Returned false | | | | |- folio_detach_private() | | |- folio_detach_private() for X | |- folio_test_private() | | |- folio_test_private() | Returned true | | | Returned true |- folio_put() | |- folio_put() Now there are two puts on the same folio at folio X, leading to refcount underflow of the folio X, and eventually causing the BUG_ON() on the page->mapping. The condition is not that easy to hit: - The release must be triggered for the middle page of an eb If the release is on the same first page of an eb, page lock would kick in and prevent the race. - folio_detach_private() has a very small race window It's only between folio_test_private() and folio_clear_private(). That's exactly when mapping->i_private_lock is used to prevent such race, and commit 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") screwed that up. At that time, I thought the page lock would kick in as filemap_release_folio() also requires the page to be locked, but forgot the filemap_release_folio() only locks one page, not all pages of an extent buffer. [FIX] Move all the code requiring i_private_lock into attach_eb_folio_to_filemap(), so that everything is done with proper lock protection. Furthermore to prevent future problems, add an extra lockdep_assert_locked() to ensure we're holding the proper lock. To reproducer that is able to hit the race (takes a few minutes with instrumented code inserting delays to alloc_extent_buffer()): #!/bin/sh drop_caches () { while(true); do echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory done } run_tar () { while(true); do for x in `seq 1 80` ; do tar cf /dev/zero /mnt > /dev/null & done wait done } mkfs.btrfs -f -d single -m single /dev/vda mount -o noatime /dev/vda /mnt # create 200,000 files, 1K each ./simoop -n 200000 -E -f 1k /mnt drop_caches & (run_tar) Reported-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/ Reported-by: Mikhail Gavrilov <[email protected]> Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/ Reported-by: Toralf Förster <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Reported-by: [email protected] Fixes: 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") CC: [email protected] # 6.8+ CC: Chris Mason <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent fb33eb2 commit f3a5367

File tree

1 file changed

+31
-29
lines changed

1 file changed

+31
-29
lines changed

fs/btrfs/extent_io.c

+31-29
Original file line numberDiff line numberDiff line change
@@ -3689,6 +3689,8 @@ static struct extent_buffer *grab_extent_buffer(
36893689
struct folio *folio = page_folio(page);
36903690
struct extent_buffer *exists;
36913691

3692+
lockdep_assert_held(&page->mapping->i_private_lock);
3693+
36923694
/*
36933695
* For subpage case, we completely rely on radix tree to ensure we
36943696
* don't try to insert two ebs for the same bytenr. So here we always
@@ -3756,13 +3758,14 @@ static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
37563758
* The caller needs to free the existing folios and retry using the same order.
37573759
*/
37583760
static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
3761+
struct btrfs_subpage *prealloc,
37593762
struct extent_buffer **found_eb_ret)
37603763
{
37613764

37623765
struct btrfs_fs_info *fs_info = eb->fs_info;
37633766
struct address_space *mapping = fs_info->btree_inode->i_mapping;
37643767
const unsigned long index = eb->start >> PAGE_SHIFT;
3765-
struct folio *existing_folio;
3768+
struct folio *existing_folio = NULL;
37663769
int ret;
37673770

37683771
ASSERT(found_eb_ret);
@@ -3774,12 +3777,14 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
37743777
ret = filemap_add_folio(mapping, eb->folios[i], index + i,
37753778
GFP_NOFS | __GFP_NOFAIL);
37763779
if (!ret)
3777-
return 0;
3780+
goto finish;
37783781

37793782
existing_folio = filemap_lock_folio(mapping, index + i);
37803783
/* The page cache only exists for a very short time, just retry. */
3781-
if (IS_ERR(existing_folio))
3784+
if (IS_ERR(existing_folio)) {
3785+
existing_folio = NULL;
37823786
goto retry;
3787+
}
37833788

37843789
/* For now, we should only have single-page folios for btree inode. */
37853790
ASSERT(folio_nr_pages(existing_folio) == 1);
@@ -3790,21 +3795,21 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
37903795
return -EAGAIN;
37913796
}
37923797

3793-
if (fs_info->nodesize < PAGE_SIZE) {
3794-
/*
3795-
* We're going to reuse the existing page, can drop our page
3796-
* and subpage structure now.
3797-
*/
3798+
finish:
3799+
spin_lock(&mapping->i_private_lock);
3800+
if (existing_folio && fs_info->nodesize < PAGE_SIZE) {
3801+
/* We're going to reuse the existing page, can drop our folio now. */
37983802
__free_page(folio_page(eb->folios[i], 0));
37993803
eb->folios[i] = existing_folio;
3800-
} else {
3804+
} else if (existing_folio) {
38013805
struct extent_buffer *existing_eb;
38023806

38033807
existing_eb = grab_extent_buffer(fs_info,
38043808
folio_page(existing_folio, 0));
38053809
if (existing_eb) {
38063810
/* The extent buffer still exists, we can use it directly. */
38073811
*found_eb_ret = existing_eb;
3812+
spin_unlock(&mapping->i_private_lock);
38083813
folio_unlock(existing_folio);
38093814
folio_put(existing_folio);
38103815
return 1;
@@ -3813,6 +3818,22 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
38133818
__free_page(folio_page(eb->folios[i], 0));
38143819
eb->folios[i] = existing_folio;
38153820
}
3821+
eb->folio_size = folio_size(eb->folios[i]);
3822+
eb->folio_shift = folio_shift(eb->folios[i]);
3823+
/* Should not fail, as we have preallocated the memory. */
3824+
ret = attach_extent_buffer_folio(eb, eb->folios[i], prealloc);
3825+
ASSERT(!ret);
3826+
/*
3827+
* To inform we have an extra eb under allocation, so that
3828+
* detach_extent_buffer_page() won't release the folio private when the
3829+
* eb hasn't been inserted into radix tree yet.
3830+
*
3831+
* The ref will be decreased when the eb releases the page, in
3832+
* detach_extent_buffer_page(). Thus needs no special handling in the
3833+
* error path.
3834+
*/
3835+
btrfs_folio_inc_eb_refs(fs_info, eb->folios[i]);
3836+
spin_unlock(&mapping->i_private_lock);
38163837
return 0;
38173838
}
38183839

@@ -3824,7 +3845,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
38243845
int attached = 0;
38253846
struct extent_buffer *eb;
38263847
struct extent_buffer *existing_eb = NULL;
3827-
struct address_space *mapping = fs_info->btree_inode->i_mapping;
38283848
struct btrfs_subpage *prealloc = NULL;
38293849
u64 lockdep_owner = owner_root;
38303850
bool page_contig = true;
@@ -3890,7 +3910,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
38903910
for (int i = 0; i < num_folios; i++) {
38913911
struct folio *folio;
38923912

3893-
ret = attach_eb_folio_to_filemap(eb, i, &existing_eb);
3913+
ret = attach_eb_folio_to_filemap(eb, i, prealloc, &existing_eb);
38943914
if (ret > 0) {
38953915
ASSERT(existing_eb);
38963916
goto out;
@@ -3927,24 +3947,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
39273947
* and free the allocated page.
39283948
*/
39293949
folio = eb->folios[i];
3930-
eb->folio_size = folio_size(folio);
3931-
eb->folio_shift = folio_shift(folio);
3932-
spin_lock(&mapping->i_private_lock);
3933-
/* Should not fail, as we have preallocated the memory */
3934-
ret = attach_extent_buffer_folio(eb, folio, prealloc);
3935-
ASSERT(!ret);
3936-
/*
3937-
* To inform we have extra eb under allocation, so that
3938-
* detach_extent_buffer_page() won't release the folio private
3939-
* when the eb hasn't yet been inserted into radix tree.
3940-
*
3941-
* The ref will be decreased when the eb released the page, in
3942-
* detach_extent_buffer_page().
3943-
* Thus needs no special handling in error path.
3944-
*/
3945-
btrfs_folio_inc_eb_refs(fs_info, folio);
3946-
spin_unlock(&mapping->i_private_lock);
3947-
39483950
WARN_ON(btrfs_folio_test_dirty(fs_info, folio, eb->start, eb->len));
39493951

39503952
/*

0 commit comments

Comments
 (0)