Skip to content

Commit 37d0ec3

Browse files
George Spelvintorvalds
George Spelvin
authored andcommitted
lib/sort: make swap functions more generic
Patch series "lib/sort & lib/list_sort: faster and smaller", v2. Because CONFIG_RETPOLINE has made indirect calls much more expensive, I thought I'd try to reduce the number made by the library sort functions. The first three patches apply to lib/sort.c. Patch #1 is a simple optimization. The built-in swap has special cases for aligned 4- and 8-byte objects. But those are almost never used; most calls to sort() work on larger structures, which fall back to the byte-at-a-time loop. This generalizes them to aligned *multiples* of 4 and 8 bytes. (If nothing else, it saves an awful lot of energy by not thrashing the store buffers as much.) Patch #2 grabs a juicy piece of low-hanging fruit. I agree that nice simple solid heapsort is preferable to more complex algorithms (sorry, Andrey), but it's possible to implement heapsort with far fewer comparisons (50% asymptotically, 25-40% reduction for realistic sizes) than the way it's been done up to now. And with some care, the code ends up smaller, as well. This is the "big win" patch. Patch #3 adds the same sort of indirect call bypass that has been added to the net code of late. The great majority of the callers use the builtin swap functions, so replace the indirect call to sort_func with a (highly preditable) series of if() statements. Rather surprisingly, this decreased code size, as the swap functions were inlined and their prologue & epilogue code eliminated. lib/list_sort.c is a bit trickier, as merge sort is already close to optimal, and we don't want to introduce triumphs of theory over practicality like the Ford-Johnson merge-insertion sort. Patch #4, without changing the algorithm, chops 32% off the code size and removes the part[MAX_LIST_LENGTH+1] pointer array (and the corresponding upper limit on efficiently sortable input size). Patch #5 improves the algorithm. The previous code is already optimal for power-of-two (or slightly smaller) size inputs, but when the input size is just over a power of 2, there's a very unbalanced final merge. There are, in the literature, several algorithms which solve this, but they all depend on the "breadth-first" merge order which was replaced by commit 835cc0c with a more cache-friendly "depth-first" order. Some hard thinking came up with a depth-first algorithm which defers merges as little as possible while avoiding bad merges. This saves 0.2*n compares, averaged over all sizes. The code size increase is minimal (64 bytes on x86-64, reducing the net savings to 26%), but the comments expanded significantly to document the clever algorithm. TESTING NOTES: I have some ugly user-space benchmarking code which I used for testing before moving this code into the kernel. Shout if you want a copy. I'm running this code right now, with CONFIG_TEST_SORT and CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since the last round of minor edits to quell checkpatch. I figure there will be at least one round of comments and final testing. This patch (of 5): Rather than having special-case swap functions for 4- and 8-byte objects, special-case aligned multiples of 4 or 8 bytes. This speeds up most users of sort() by avoiding fallback to the byte copy loop. Despite what ca96ab8 ("lib/sort: Add 64 bit swap function") claims, very few users of sort() sort pointers (or pointer-sized objects); most sort structures containing at least two words. (E.g. drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte struct acpi_fan_fps.) The functions also got renamed to reflect the fact that they support multiple words. In the great tradition of bikeshedding, the names were by far the most contentious issue during review of this patch series. x86-64 code size 872 -> 886 bytes (+14) With feedback from Andy Shevchenko, Rasmus Villemoes and Geert Uytterhoeven. Link: http://lkml.kernel.org/r/f24f932df3a7fa1973c1084154f1cea596bcf341.1552704200.git.lkml@sdf.org Signed-off-by: George Spelvin <[email protected]> Acked-by: Andrey Abramov <[email protected]> Acked-by: Rasmus Villemoes <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Daniel Wagner <[email protected]> Cc: Don Mullis <[email protected]> Cc: Dave Chinner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 8e18fae commit 37d0ec3

File tree

1 file changed

+99
-24
lines changed

1 file changed

+99
-24
lines changed

lib/sort.c

Lines changed: 99 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -11,35 +11,108 @@
1111
#include <linux/export.h>
1212
#include <linux/sort.h>
1313

14-
static int alignment_ok(const void *base, int align)
14+
/**
15+
* is_aligned - is this pointer & size okay for word-wide copying?
16+
* @base: pointer to data
17+
* @size: size of each element
18+
* @align: required aignment (typically 4 or 8)
19+
*
20+
* Returns true if elements can be copied using word loads and stores.
21+
* The size must be a multiple of the alignment, and the base address must
22+
* be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
23+
*
24+
* For some reason, gcc doesn't know to optimize "if (a & mask || b & mask)"
25+
* to "if ((a | b) & mask)", so we do that by hand.
26+
*/
27+
__attribute_const__ __always_inline
28+
static bool is_aligned(const void *base, size_t size, unsigned char align)
1529
{
16-
return IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
17-
((unsigned long)base & (align - 1)) == 0;
30+
unsigned char lsbits = (unsigned char)size;
31+
32+
(void)base;
33+
#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
34+
lsbits |= (unsigned char)(uintptr_t)base;
35+
#endif
36+
return (lsbits & (align - 1)) == 0;
1837
}
1938

20-
static void u32_swap(void *a, void *b, int size)
39+
/**
40+
* swap_words_32 - swap two elements in 32-bit chunks
41+
* @a, @b: pointers to the elements
42+
* @size: element size (must be a multiple of 4)
43+
*
44+
* Exchange the two objects in memory. This exploits base+index addressing,
45+
* which basically all CPUs have, to minimize loop overhead computations.
46+
*
47+
* For some reason, on x86 gcc 7.3.0 adds a redundant test of n at the
48+
* bottom of the loop, even though the zero flag is stil valid from the
49+
* subtract (since the intervening mov instructions don't alter the flags).
50+
* Gcc 8.1.0 doesn't have that problem.
51+
*/
52+
static void swap_words_32(void *a, void *b, int size)
2153
{
22-
u32 t = *(u32 *)a;
23-
*(u32 *)a = *(u32 *)b;
24-
*(u32 *)b = t;
54+
size_t n = (unsigned int)size;
55+
56+
do {
57+
u32 t = *(u32 *)(a + (n -= 4));
58+
*(u32 *)(a + n) = *(u32 *)(b + n);
59+
*(u32 *)(b + n) = t;
60+
} while (n);
2561
}
2662

27-
static void u64_swap(void *a, void *b, int size)
63+
/**
64+
* swap_words_64 - swap two elements in 64-bit chunks
65+
* @a, @b: pointers to the elements
66+
* @size: element size (must be a multiple of 8)
67+
*
68+
* Exchange the two objects in memory. This exploits base+index
69+
* addressing, which basically all CPUs have, to minimize loop overhead
70+
* computations.
71+
*
72+
* We'd like to use 64-bit loads if possible. If they're not, emulating
73+
* one requires base+index+4 addressing which x86 has but most other
74+
* processors do not. If CONFIG_64BIT, we definitely have 64-bit loads,
75+
* but it's possible to have 64-bit loads without 64-bit pointers (e.g.
76+
* x32 ABI). Are there any cases the kernel needs to worry about?
77+
*/
78+
static void swap_words_64(void *a, void *b, int size)
2879
{
29-
u64 t = *(u64 *)a;
30-
*(u64 *)a = *(u64 *)b;
31-
*(u64 *)b = t;
80+
size_t n = (unsigned int)size;
81+
82+
do {
83+
#ifdef CONFIG_64BIT
84+
u64 t = *(u64 *)(a + (n -= 8));
85+
*(u64 *)(a + n) = *(u64 *)(b + n);
86+
*(u64 *)(b + n) = t;
87+
#else
88+
/* Use two 32-bit transfers to avoid base+index+4 addressing */
89+
u32 t = *(u32 *)(a + (n -= 4));
90+
*(u32 *)(a + n) = *(u32 *)(b + n);
91+
*(u32 *)(b + n) = t;
92+
93+
t = *(u32 *)(a + (n -= 4));
94+
*(u32 *)(a + n) = *(u32 *)(b + n);
95+
*(u32 *)(b + n) = t;
96+
#endif
97+
} while (n);
3298
}
3399

34-
static void generic_swap(void *a, void *b, int size)
100+
/**
101+
* swap_bytes - swap two elements a byte at a time
102+
* @a, @b: pointers to the elements
103+
* @size: element size
104+
*
105+
* This is the fallback if alignment doesn't allow using larger chunks.
106+
*/
107+
static void swap_bytes(void *a, void *b, int size)
35108
{
36-
char t;
109+
size_t n = (unsigned int)size;
37110

38111
do {
39-
t = *(char *)a;
40-
*(char *)a++ = *(char *)b;
41-
*(char *)b++ = t;
42-
} while (--size > 0);
112+
char t = ((char *)a)[--n];
113+
((char *)a)[n] = ((char *)b)[n];
114+
((char *)b)[n] = t;
115+
} while (n);
43116
}
44117

45118
/**
@@ -50,8 +123,10 @@ static void generic_swap(void *a, void *b, int size)
50123
* @cmp_func: pointer to comparison function
51124
* @swap_func: pointer to swap function or NULL
52125
*
53-
* This function does a heapsort on the given array. You may provide a
54-
* swap_func function optimized to your element type.
126+
* This function does a heapsort on the given array. You may provide
127+
* a swap_func function if you need to do something more than a memory
128+
* copy (e.g. fix up pointers or auxiliary data), but the built-in swap
129+
* isn't usually a bottleneck.
55130
*
56131
* Sorting time is O(n log n) both on average and worst-case. While
57132
* qsort is about 20% faster on average, it suffers from exploitable
@@ -67,12 +142,12 @@ void sort(void *base, size_t num, size_t size,
67142
int i = (num/2 - 1) * size, n = num * size, c, r;
68143

69144
if (!swap_func) {
70-
if (size == 4 && alignment_ok(base, 4))
71-
swap_func = u32_swap;
72-
else if (size == 8 && alignment_ok(base, 8))
73-
swap_func = u64_swap;
145+
if (is_aligned(base, size, 8))
146+
swap_func = swap_words_64;
147+
else if (is_aligned(base, size, 4))
148+
swap_func = swap_words_32;
74149
else
75-
swap_func = generic_swap;
150+
swap_func = swap_bytes;
76151
}
77152

78153
/* heapify */

0 commit comments

Comments
 (0)