Skip to content

Commit b87537d

Browse files
Hugh Dickinstorvalds
Hugh Dickins
authored andcommitted
mm: rmap use pte lock not mmap_sem to set PageMlocked
KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page found mapped in a VM_LOCKED vma) is ineffective against races with exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when tearing down an mm. But that's okay, those races are benign; and although we've believed for years in that ugly down_read_trylock(), it's unsuitable for the job, and frustrates the good intention of setting PageMlocked when it fails. It just doesn't matter if here we read vm_flags an instant before or after a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the syscalls (or exit) work their way up the address space (taking pt locks after updating vm_flags) to establish the final state. We do still need to be careful never to mark a page Mlocked (hence unevictable) by any race that will not be corrected shortly after. The page lock protects from many of the races, but not all (a page is not necessarily locked when it's unmapped). But the pte lock we just dropped is good to cover the rest (and serializes even with munlock_vma_pages_all(), so no special barriers required): now hold on to the pte lock while calling mlock_vma_page(). Is that lock ordering safe? Yes, that's how follow_page_pte() calls it, and how page_remove_rmap() calls the complementary clear_page_mlock(). This fixes the following case (though not a case which anyone has complained of), which mmap_sem did not: truncation's preliminary unmap_mapping_range() is supposed to remove even the anonymous COWs of filecache pages, and that might race with try_to_unmap_one() on a VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when freed. The pte lock protects against this. You could say that it also protects against the more ordinary case, racing with the preliminary unmapping of a filecache page itself: but in our current tree, that's independently protected by i_mmap_rwsem; and that race would be why "Bad page state (mlocked)" was seen before commit 48ec833 ("Revert mm/memory.c: share the i_mmap_rwsem"). Vlastimil Babka points out another race which this patch protects against. try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked: leaving PageMlocked and unevictable when it should be evictable. mmap_sem is ineffective because exit_mmap() does not hold it; page lock ineffective because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock is effective because __munlock_pagevec_fill() takes it to get the page, after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one. Kirill Shutemov points out that if the compiler chooses to implement a "vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation with an intermediate store of unrelated bits set, since I'm here foregoing its usual protection by mmap_sem, try_to_unmap_one() might catch sight of a spurious VM_LOCKED in vm_flags, and make the wrong decision. This does not appear to be an immediate problem, but we may want to define vm_flags accessors in future, to guard against such a possibility. While we're here, make a related optimization in try_to_munmap_one(): if it's doing TTU_MUNLOCK, then there's no point at all in descending the page tables and getting the pt lock, unless the vma is VM_LOCKED. Yes, that can change racily, but it can change racily even without the optimization: it's not critical. Far better not to waste time here. Stopped short of separating try_to_munlock_one() from try_to_munmap_one() on this occasion, but that's probably the sensible next step - with a rename, given that try_to_munlock()'s business is to try to set Mlocked. Updated the unevictable-lru Documentation, to remove its reference to mmap semaphore, but found a few more updates needed in just that area. Signed-off-by: Hugh Dickins <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 7a14239 commit b87537d

File tree

2 files changed

+27
-70
lines changed

2 files changed

+27
-70
lines changed

Documentation/vm/unevictable-lru.txt

Lines changed: 16 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -531,37 +531,20 @@ map.
531531

532532
try_to_unmap() is always called, by either vmscan for reclaim or for page
533533
migration, with the argument page locked and isolated from the LRU. Separate
534-
functions handle anonymous and mapped file pages, as these types of pages have
535-
different reverse map mechanisms.
534+
functions handle anonymous and mapped file and KSM pages, as these types of
535+
pages have different reverse map lookup mechanisms, with different locking.
536+
In each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(),
537+
it will call try_to_unmap_one() for every VMA which might contain the page.
536538

537-
(*) try_to_unmap_anon()
539+
When trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED
540+
VMA, it will then mlock the page via mlock_vma_page() instead of unmapping it,
541+
and return SWAP_MLOCK to indicate that the page is unevictable: and the scan
542+
stops there.
538543

539-
To unmap anonymous pages, each VMA in the list anchored in the anon_vma
540-
must be visited - at least until a VM_LOCKED VMA is encountered. If the
541-
page is being unmapped for migration, VM_LOCKED VMAs do not stop the
542-
process because mlocked pages are migratable. However, for reclaim, if
543-
the page is mapped into a VM_LOCKED VMA, the scan stops.
544-
545-
try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of
546-
the mm_struct to which the VMA belongs. If this is successful, it will
547-
mlock the page via mlock_vma_page() - we wouldn't have gotten to
548-
try_to_unmap_anon() if the page were already mlocked - and will return
549-
SWAP_MLOCK, indicating that the page is unevictable.
550-
551-
If the mmap semaphore cannot be acquired, we are not sure whether the page
552-
is really unevictable or not. In this case, try_to_unmap_anon() will
553-
return SWAP_AGAIN.
554-
555-
(*) try_to_unmap_file()
556-
557-
Unmapping of a mapped file page works the same as for anonymous mappings,
558-
except that the scan visits all VMAs that map the page's index/page offset
559-
in the page's mapping's reverse map interval search tree.
560-
561-
As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file
562-
page, try_to_unmap_file() will attempt to acquire the associated
563-
mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this
564-
is successful, and SWAP_AGAIN, if not.
544+
mlock_vma_page() is called while holding the page table's lock (in addition
545+
to the page lock, and the rmap lock): to serialize against concurrent mlock or
546+
munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
547+
holepunching, and truncation of file pages and their anonymous COWed pages.
565548

566549

567550
try_to_munlock() REVERSE MAP SCAN
@@ -577,22 +560,15 @@ all PTEs from the page. For this purpose, the unevictable/mlock infrastructure
577560
introduced a variant of try_to_unmap() called try_to_munlock().
578561

579562
try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
580-
mapped file pages with an additional argument specifying unlock versus unmap
563+
mapped file and KSM pages with a flag argument specifying unlock versus unmap
581564
processing. Again, these functions walk the respective reverse maps looking
582565
for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case,
583-
the functions attempt to acquire the associated mmap semaphore, mlock the page
584-
via mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
585-
pre-clearing of the page's PG_mlocked done by munlock_vma_page.
586-
587-
If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap
588-
semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to
589-
recycle the page on the inactive list and hope that it has better luck with the
590-
page next time.
566+
the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This
567+
undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
591568

592569
Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
593570
reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
594-
However, the scan can terminate when it encounters a VM_LOCKED VMA and can
595-
successfully acquire the VMA's mmap semaphore for read and mlock the page.
571+
However, the scan can terminate when it encounters a VM_LOCKED VMA.
596572
Although try_to_munlock() might be called a great many times when munlocking a
597573
large region or tearing down a large address space that has been mlocked via
598574
mlockall(), overall this is a fairly rare event.
@@ -620,11 +596,6 @@ Some examples of these unevictable pages on the LRU lists are:
620596
(3) mlocked pages that could not be isolated from the LRU and moved to the
621597
unevictable list in mlock_vma_page().
622598

623-
(4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
624-
acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
625-
munlock_vma_page() was forced to let the page back on to the normal LRU
626-
list for vmscan to handle.
627-
628599
shrink_inactive_list() also diverts any unevictable pages that it finds on the
629600
inactive lists to the appropriate zone's unevictable list.
630601

mm/rmap.c

Lines changed: 11 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1304,6 +1304,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
13041304
int ret = SWAP_AGAIN;
13051305
enum ttu_flags flags = (enum ttu_flags)arg;
13061306

1307+
/* munlock has nothing to gain from examining un-locked vmas */
1308+
if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
1309+
goto out;
1310+
13071311
pte = page_check_address(page, mm, address, &ptl, 0);
13081312
if (!pte)
13091313
goto out;
@@ -1314,9 +1318,12 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
13141318
* skipped over this mm) then we should reactivate it.
13151319
*/
13161320
if (!(flags & TTU_IGNORE_MLOCK)) {
1317-
if (vma->vm_flags & VM_LOCKED)
1318-
goto out_mlock;
1319-
1321+
if (vma->vm_flags & VM_LOCKED) {
1322+
/* Holding pte lock, we do *not* need mmap_sem here */
1323+
mlock_vma_page(page);
1324+
ret = SWAP_MLOCK;
1325+
goto out_unmap;
1326+
}
13201327
if (flags & TTU_MUNLOCK)
13211328
goto out_unmap;
13221329
}
@@ -1421,31 +1428,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
14211428

14221429
out_unmap:
14231430
pte_unmap_unlock(pte, ptl);
1424-
if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
1431+
if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
14251432
mmu_notifier_invalidate_page(mm, address);
14261433
out:
14271434
return ret;
1428-
1429-
out_mlock:
1430-
pte_unmap_unlock(pte, ptl);
1431-
1432-
1433-
/*
1434-
* We need mmap_sem locking, Otherwise VM_LOCKED check makes
1435-
* unstable result and race. Plus, We can't wait here because
1436-
* we now hold anon_vma->rwsem or mapping->i_mmap_rwsem.
1437-
* if trylock failed, the page remain in evictable lru and later
1438-
* vmscan could retry to move the page to unevictable lru if the
1439-
* page is actually mlocked.
1440-
*/
1441-
if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
1442-
if (vma->vm_flags & VM_LOCKED) {
1443-
mlock_vma_page(page);
1444-
ret = SWAP_MLOCK;
1445-
}
1446-
up_read(&vma->vm_mm->mmap_sem);
1447-
}
1448-
return ret;
14491435
}
14501436

14511437
bool is_vma_temporary_stack(struct vm_area_struct *vma)

0 commit comments

Comments
 (0)