Skip to content

Commit 2b35c1a

Browse files
arkamargregkh
authored andcommitted
mm: fix folio_pte_batch() on XEN PV
commit 7b08b74 upstream. On XEN PV, folio_pte_batch() can incorrectly batch beyond the end of a folio due to a corner case in pte_advance_pfn(). Specifically, when the PFN following the folio maps to an invalidated MFN, expected_pte = pte_advance_pfn(expected_pte, nr); produces a pte_none(). If the actual next PTE in memory is also pte_none(), the pte_same() succeeds, if (!pte_same(pte, expected_pte)) break; the loop is not broken, and batching continues into unrelated memory. For example, with a 4-page folio, the PTE layout might look like this: [ 53.465673] [ T2552] folio_pte_batch: printing PTE values at addr=0x7f1ac9dc5000 [ 53.465674] [ T2552] PTE[453] = 000000010085c125 [ 53.465679] [ T2552] PTE[454] = 000000010085d125 [ 53.465682] [ T2552] PTE[455] = 000000010085e125 [ 53.465684] [ T2552] PTE[456] = 000000010085f125 [ 53.465686] [ T2552] PTE[457] = 0000000000000000 <-- not present [ 53.465689] [ T2552] PTE[458] = 0000000101da7125 pte_advance_pfn(PTE[456]) returns a pte_none() due to invalid PFN->MFN mapping. The next actual PTE (PTE[457]) is also pte_none(), so the loop continues and includes PTE[457] in the batch, resulting in 5 batched entries for a 4-page folio. This triggers the following warning: [ 53.465751] [ T2552] page: refcount:85 mapcount:20 mapping:ffff88813ff4f6a8 index:0x110 pfn:0x10085c [ 53.465754] [ T2552] head: order:2 mapcount:80 entire_mapcount:0 nr_pages_mapped:4 pincount:0 [ 53.465756] [ T2552] memcg:ffff888003573000 [ 53.465758] [ T2552] aops:0xffffffff8226fd20 ino:82467c dentry name(?):"libc.so.6" [ 53.465761] [ T2552] flags: 0x2000000000416c(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 53.465764] [ T2552] raw: 002000000000416c ffffea0004021f08 ffffea0004021908 ffff88813ff4f6a8 [ 53.465767] [ T2552] raw: 0000000000000110 ffff888133d8bd40 0000005500000013 ffff888003573000 [ 53.465768] [ T2552] head: 002000000000416c ffffea0004021f08 ffffea0004021908 ffff88813ff4f6a8 [ 53.465770] [ T2552] head: 0000000000000110 ffff888133d8bd40 0000005500000013 ffff888003573000 [ 53.465772] [ T2552] head: 0020000000000202 ffffea0004021701 000000040000004f 00000000ffffffff [ 53.465774] [ T2552] head: 0000000300000003 8000000300000002 0000000000000013 0000000000000004 [ 53.465775] [ T2552] page dumped because: VM_WARN_ON_FOLIO((_Generic((page + nr_pages - 1), const struct page *: (const struct folio *)_compound_head(page + nr_pages - 1), struct page *: (struct folio *)_compound_head(page + nr_pages - 1))) != folio) Original code works as expected everywhere, except on XEN PV, where pte_advance_pfn() can yield a pte_none() after balloon inflation due to MFNs invalidation. In XEN, pte_advance_pfn() ends up calling __pte()->xen_make_pte()->pte_pfn_to_mfn(), which returns pte_none() when mfn == INVALID_P2M_ENTRY. The pte_pfn_to_mfn() documents that nastiness: If there's no mfn for the pfn, then just create an empty non-present pte. Unfortunately this loses information about the original pfn, so pte_mfn_to_pfn is asymmetric. While such hacks should certainly be removed, we can do better in folio_pte_batch() and simply check ahead of time how many PTEs we can possibly batch in our folio. This way, we can not only fix the issue but cleanup the code: removing the pte_pfn() check inside the loop body and avoiding end_ptr comparison + arithmetic. Link: https://lkml.kernel.org/r/[email protected] Fixes: f8d9377 ("mm/memory: optimize fork() with PTE-mapped THP") Co-developed-by: David Hildenbrand <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Petr Vaněk <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
1 parent 399ec9c commit 2b35c1a

File tree

1 file changed

+11
-16
lines changed

1 file changed

+11
-16
lines changed

mm/internal.h

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -205,11 +205,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
205205
pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
206206
bool *any_writable, bool *any_young, bool *any_dirty)
207207
{
208-
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
209-
const pte_t *end_ptep = start_ptep + max_nr;
210208
pte_t expected_pte, *ptep;
211209
bool writable, young, dirty;
212-
int nr;
210+
int nr, cur_nr;
213211

214212
if (any_writable)
215213
*any_writable = false;
@@ -222,11 +220,15 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
222220
VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
223221
VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
224222

223+
/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
224+
max_nr = min_t(unsigned long, max_nr,
225+
folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
226+
225227
nr = pte_batch_hint(start_ptep, pte);
226228
expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
227229
ptep = start_ptep + nr;
228230

229-
while (ptep < end_ptep) {
231+
while (nr < max_nr) {
230232
pte = ptep_get(ptep);
231233
if (any_writable)
232234
writable = !!pte_write(pte);
@@ -239,27 +241,20 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
239241
if (!pte_same(pte, expected_pte))
240242
break;
241243

242-
/*
243-
* Stop immediately once we reached the end of the folio. In
244-
* corner cases the next PFN might fall into a different
245-
* folio.
246-
*/
247-
if (pte_pfn(pte) >= folio_end_pfn)
248-
break;
249-
250244
if (any_writable)
251245
*any_writable |= writable;
252246
if (any_young)
253247
*any_young |= young;
254248
if (any_dirty)
255249
*any_dirty |= dirty;
256250

257-
nr = pte_batch_hint(ptep, pte);
258-
expected_pte = pte_advance_pfn(expected_pte, nr);
259-
ptep += nr;
251+
cur_nr = pte_batch_hint(ptep, pte);
252+
expected_pte = pte_advance_pfn(expected_pte, cur_nr);
253+
ptep += cur_nr;
254+
nr += cur_nr;
260255
}
261256

262-
return min(ptep - start_ptep, max_nr);
257+
return min(nr, max_nr);
263258
}
264259

265260
/**

0 commit comments

Comments
 (0)