Skip to content

Commit 691ee97

Browse files
Ryan Robertsakpm00
Ryan Roberts
authored andcommitted
mm: fix lazy mmu docs and usage
Patch series "Fix lazy mmu mode", v2. I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table walkers. While lazy mmu mode is already used for kernel mappings in a few places, this will extend it's use significantly. Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86, it looks like there are a bunch of bugs, some of which may be more likely to trigger once I extend the use of lazy mmu. So this series attempts to clarify the requirements and fix all the bugs in advance of that series. See patch #1 commit log for all the details. This patch (of 5): The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode() is a bit of a mess (to put it politely). There are a number of issues related to nesting of lazy mmu regions and confusion over whether the task, when in a lazy mmu region, is preemptible or not. Fix all the issues relating to the core-mm. Follow up commits will fix the arch-specific implementations. 3 arches implement lazy mmu; powerpc, sparc and x86. When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit 6606c3e ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was expected that lazy mmu regions would never nest and that the appropriate page table lock(s) would be held while in the region, thus ensuring the region is non-preemptible. Additionally lazy mmu regions were only used during manipulation of user mappings. Commit 38e0edb ("mm/apply_to_range: call pte function with lazy updates") started invoking the lazy mmu mode in apply_to_pte_range(), which is used for both user and kernel mappings. For kernel mappings the region is no longer protected by any lock so there is no longer any guarantee about non-preemptibility. Additionally, for RT configs, the holding the PTL only implies no CPU migration, it doesn't prevent preemption. Commit bcc6cc8 ("mm: add default definition of set_ptes()") added arch_[enter|leave]_lazy_mmu_mode() to the default implementation of set_ptes(), used by x86. So after this commit, lazy mmu regions can be nested. Additionally commit 1a10a44 ("sparc64: implement the new page table range API") and commit 9fee28b ("powerpc: implement the new page table range API") did the same for the sparc and powerpc set_ptes() overrides. powerpc couldn't deal with preemption so avoids it in commit b9ef323 ("powerpc/64s: Disable preemption in hash lazy mmu mode"), which explicitly disables preemption for the whole region in its implementation. x86 can support preemption (or at least it could until it tried to add support nesting; more on this below). Sparc looks to be totally broken in the face of preemption, as far as I can tell. powerpc can't deal with nesting, so avoids it in commit 47b8def ("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"), which removes the lazy mmu calls from its implementation of set_ptes(). x86 attempted to support nesting in commit 49147be ("x86/xen: allow nesting of same lazy mode") but as far as I can tell, this breaks its support for preemption. In short, it's all a mess; the semantics for arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a result the implementations all have different expectations, sticking plasters and bugs. arm64 is aiming to start using these hooks, so let's clean everything up before adding an arm64 implementation. Update the documentation to state that lazy mmu regions can never be nested, must not be called in interrupt context and preemption may or may not be enabled for the duration of the region. And fix the generic implementation of set_ptes() to avoid nesting. arch-specific fixes to conform to the new spec will proceed this one. These issues were spotted by code review and I have no evidence of issues being reported in the wild. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: bcc6cc8 ("mm: add default definition of set_ptes()") Signed-off-by: Ryan Roberts <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Borislav Betkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David S. Miller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juegren Gross <[email protected]> Cc: Matthew Wilcow (Oracle) <[email protected]> Cc: Thomas Gleinxer <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent b243d66 commit 691ee97

File tree

1 file changed

+8
-6
lines changed

1 file changed

+8
-6
lines changed

include/linux/pgtable.h

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -222,10 +222,14 @@ static inline int pmd_dirty(pmd_t pmd)
222222
* hazard could result in the direct mode hypervisor case, since the actual
223223
* write to the page tables may not yet have taken place, so reads though
224224
* a raw PTE pointer after it has been modified are not guaranteed to be
225-
* up to date. This mode can only be entered and left under the protection of
226-
* the page table locks for all page tables which may be modified. In the UP
227-
* case, this is required so that preemption is disabled, and in the SMP case,
228-
* it must synchronize the delayed page table writes properly on other CPUs.
225+
* up to date.
226+
*
227+
* In the general case, no lock is guaranteed to be held between entry and exit
228+
* of the lazy mode. So the implementation must assume preemption may be enabled
229+
* and cpu migration is possible; it must take steps to be robust against this.
230+
* (In practice, for user PTE updates, the appropriate page table lock(s) are
231+
* held, but for kernel PTE updates, no lock is held). Nesting is not permitted
232+
* and the mode cannot be used in interrupt context.
229233
*/
230234
#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
231235
#define arch_enter_lazy_mmu_mode() do {} while (0)
@@ -287,15 +291,13 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
287291
{
288292
page_table_check_ptes_set(mm, ptep, pte, nr);
289293

290-
arch_enter_lazy_mmu_mode();
291294
for (;;) {
292295
set_pte(ptep, pte);
293296
if (--nr == 0)
294297
break;
295298
ptep++;
296299
pte = pte_next_pfn(pte);
297300
}
298-
arch_leave_lazy_mmu_mode();
299301
}
300302
#endif
301303
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

0 commit comments

Comments
 (0)