Skip to content

Enabling CONFIG_TIMESLICING and CONFIG_TICKLESS_KERNEL at the same time makes thread switching much slower #88353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fkokosinski opened this issue Apr 9, 2025 · 4 comments · May be fixed by #89426
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@fkokosinski
Copy link
Member

Describe the bug
When CONFIG_TIMESLICING (with CONFIG_TIMESLICE_SIZE set to non-zero value) and CONFIG_TICKLESS_KERNEL are enabled at the same time, the time required to switch threads is noticeably higher. In the outlier case on qemu_x86_64 with kvm enabled and single cpu the execution of the code was >100 times slower on the test platform after enabling tickless mode (without kvm and with default two cpus enabling tickless mode made it ~2 times slower). In the case of the mimxrt685_evk platform it was ~10 times slower, so the exact slowdown depends on the platform.

The timeslice size doesn't seem to affect the results in a noticeable way.

To Reproduce
Run the following code with and without CONFIG_TICKLESS_KERNEL using the following config. This sample creates THRD_C threads, THRD_C semaphores and passes control to the next thread REPEATS times in a loop. At the end it prints the runtime.

CONFIG_TIMESLICING=y
# the timeslice size doesn't affect the runtime noticeably, it just has to be bigger than 0
CONFIG_TIMESLICE_SIZE=10
# enabling tickless mode makes it slower
CONFIG_TICKLESS_KERNEL=y

# in case of qemu_x86_64 with kvm, enabling just one core and disabling SMP makes the difference between tickless and non-tickless bigger
# CONFIG_SMP=n
# CONFIG_MP_MAX_NUM_CPUS=1

CONFIG_PICOLIBC_IO_FLOAT=y
#include <zephyr/kernel.h>

#define THRD_C 2
#define REPEATS 100000

int64_t start_time;
int c = 0;

struct k_thread thrds[THRD_C];
K_THREAD_STACK_ARRAY_DEFINE(thrd_stacks, THRD_C, 4096);
struct k_sem smphs[THRD_C];

int ids[THRD_C];

void thrd_func(void *arg, void *, void *)
{
	int id = *(int*)arg;

	while (1) {
		k_sem_take(&smphs[id], K_FOREVER);
		c++;
		if (c >= REPEATS){
			printf("time %f\n", (k_uptime_get() - start_time) / 1e3);
			return;
		}
		k_sem_give(&smphs[(id + 1) % THRD_C]);
	}
}

int main(void)
{
	for (int i = 0; i < THRD_C; i++){
		ids[i] = i;
		k_sem_init(&smphs[i], 0, 1);

		(void)k_thread_create(&thrds[i], thrd_stacks[i],
				K_THREAD_STACK_SIZEOF(thrd_stacks[i]), thrd_func, &ids[i], NULL,
				NULL, 0, 0, K_NO_WAIT);
	}

	start_time = k_uptime_get();
	k_sem_give(&smphs[0]);

	for (int i = 0; i < THRD_C; i++){
		k_thread_join(&thrds[i], K_FOREVER);
	}

	return 0;
}

The snippet was built and run using the standard west build -b board_name and west build -t run/west flash commands depending on platform type, apart from qemu_x86_64 with a single CPU, which required launching it manually due to conflicts between -enable-kvm and -icount parameters:
qemu-system-x86_64 -nographic -m 32 -enable-kvm -device loader,file=build/zephyr/zephyr-qemu-main.elf -kernel build/zephyr/zephyr-qemu-locore.elf.

Measured runtimes

Platform Time w/ tickless Time w/o tickless
qemu_x86_64 2cpu 1.15s 0.45s
qemu_x86_64 2cpu kvm 0.27s 0.14s
qemu_x86_64 1cpu kvm 2.12s 0.02s
qemu_x86_64 1cpu no smp kvm 4.13s 0.01s
mimxrt685_evk/mimxrt685s/cm33 3.02s 0.31s

Expected behavior
Enabling tickless mode shouldn't impact execution time noticeably.

Impact
In one app we've noticed up to multiple second delays after starting worker threads if tickless and timeslicing were enabled at the same time.

Environment:

  • OS: Linux
  • Toolchain: Zephyr SDK 0.16.8
  • Commit SHA: c53fb67
@fkokosinski fkokosinski added area: Kernel bug The issue is a bug, or the PR is fixing a bug labels Apr 9, 2025
@peter-mitsis
Copy link
Collaborator

This test seems to be designed to elicit the worst case scenario of time-slicing. Each time a thread is switched out, the timeout associated with the current/old timeslice must be cancelled and new one created. The test then hammers the context switching via k_sem_give() and a blocking k_sem_take(). As a result, we wind up doing a lot of extra operations on the timeout queue.

With the above in mind, it would not surprise me if we got similar measurements if time-slicing was disabled, but we provided a finite timeout to k_sem_take().

All this being said, I still plan to take a closer look to better gauge what might be done about this. (Incidentally, this is arguably more of an enhancement as opposed to a bug.)

@nashif nashif added the priority: medium Medium impact/importance bug label Apr 14, 2025
@peter-mitsis
Copy link
Collaborator

Update:

  1. Using the disco_l475_iot1 board as a reference, we can get about a +4% performance boost simply by inlining the routine remove_timeout().
  2. As mentioned previously, the context switching is hammering the timeout aborting and setting. I am looking into the possibility of bypassing some of that. As our documentation states that time slicing only guarantees the maximum amount of time before the thread yields to another of equal priority (if it exists). One interpretation of this is that we may not need to always abort and reset the time slice timeout each time we switch in a new thread. That is, if the new thread can be sliced, then we may be able to bypass this costly step and piggyback on an existing time slice timeout.

@fkokosinski
Copy link
Member Author

Hey @peter-mitsis, thanks for taking the time to look into this!

With the above in mind, it would not surprise me if we got similar measurements if time-slicing was disabled, but we provided a finite timeout to k_sem_take().

Would this explain the difference we observed with CONFIG_TICKLESS_KERNEL enabled/disabled as well?

@peter-mitsis
Copy link
Collaborator

@fkokosinski - Thanks for drawing my attention back to the tickless aspect as I was getting rather focused on the on the time slicing. Yes, I think that this would explain the differences observed with CONFIG_TICKLESS_KERNEL enabled/disabled as well.

The only timeout expected to be present in the system in the provided code sample is the timeout associated with the time slice.
Consequently, when its timeout is added via z_add_timeout(), we are going to use the tickless kernel version of sys_clock_set_timeout() on each of its calls. This in turn is going to be accessing the system timer registers , which is often slow (as recently pointed out in recent PRs such as #87948).

(It is worth noting that the non-tickless version sys_clock_set_timeout() is essentially a no-op).

@kartben kartben linked a pull request May 3, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants