Skip to content

display: Writing too fast to SDL display with native_sim causes display to stay blank forever #88818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
VynDragon opened this issue Apr 19, 2025 · 20 comments · Fixed by #88886
Closed
Assignees
Labels
area: Display area: native port Host native arch port (native_sim) bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@VynDragon
Copy link
Collaborator

VynDragon commented Apr 19, 2025

Describe the bug
Writing too fast to SDL display with native_sim causes display to stay blank forever.
This is not a issue on device with a normal display.

To Reproduce
Attached file contains a app that reproduces the bug.
Uncommenting the sleep at the start, or adding display_blanking_on(display_device); before set format fixes the bug (display goes black instead of staying blank forever)

Expected behavior
Display goes black without the sleep.

Impact
Annoyance, trap if unknown.

Environment (please complete the following information):

  • OS: Arch linux latest
  • Toolchain: Zephyr SDK 17.0 and native toolchain
  • Commit 3c2d2a6

zephyr_sdl_native_sim_race.tar.gz

@VynDragon VynDragon added the bug The issue is a bug, or the PR is fixing a bug label Apr 19, 2025
@VynDragon VynDragon changed the title display: Writing too fast to SDL display with native_sim causes display to stay blank display: Writing too fast to SDL display with native_sim causes display to stay blank forever Apr 19, 2025
@JarmouniA JarmouniA added area: native port Host native arch port (native_sim) area: Display labels Apr 19, 2025
@aescolar
Copy link
Member

@aescolar aescolar assigned faxe1008 and unassigned aescolar Apr 22, 2025
@faxe1008
Copy link
Collaborator

faxe1008 commented Apr 22, 2025

Applying this patch:

diff --git a/drivers/display/display_sdl.c b/drivers/display/display_sdl.c
index e24e6ea6692..41b4d35473c 100644
--- a/drivers/display/display_sdl.c
+++ b/drivers/display/display_sdl.c
@@ -116,6 +116,7 @@ static void sdl_task_thread(void *p1, void *p2, void *p3)
 		sdl_display_zoom_pct = CONFIG_SDL_DISPLAY_ZOOM_PCT;
 	}

+    LOG_ERR("SDL thread started, zoom %d", sdl_display_zoom_pct);
 	int rc = sdl_display_init_bottom(config->height, config->width, sdl_display_zoom_pct,
 					 use_accelerator, &disp_data->window, dev,
 					 &disp_data->renderer, &disp_data->mutex,
@@ -133,8 +134,11 @@ static void sdl_task_thread(void *p1, void *p2, void *p3)
 	disp_data->display_on = false;

 	while (1) {
+        LOG_ERR("SDL thread waiting for message");
 		k_msgq_get(disp_data->task_msgq, &task, K_FOREVER);
+        LOG_ERR("SDL thread message received");
 		exec_sdl_task(dev, &task);
+        LOG_ERR("SDL thread message executed");
 		k_sem_give(&disp_data->task_sem);
 	}
 }
@@ -373,6 +377,7 @@ static int sdl_display_write(const struct device *dev, const uint16_t x,
 		return -EINVAL;
 	}

+    LOG_ERR("Writing, take mutex");
 	k_mutex_lock(&disp_data->task_mutex, K_FOREVER);
 	if (disp_data->current_pixel_format == PIXEL_FORMAT_ARGB_8888) {
 		sdl_display_write_argb8888(disp_data->buf, desc, buf);
@@ -393,11 +398,15 @@ static int sdl_display_write(const struct device *dev, const uint16_t x,
 	if (k_current_get() == &disp_data->sdl_thread) {
 		exec_sdl_task(dev, &task);
 	} else {
+        LOG_ERR("Writing, put message");
 		k_msgq_put(disp_data->task_msgq, &task, K_FOREVER);
+        LOG_ERR("Writing, wait for semaphore");
 		k_sem_take(&disp_data->task_sem, K_FOREVER);
+        LOG_ERR("Writing, semaphore taken");
 	}

 	k_mutex_unlock(&disp_data->task_mutex);
+    LOG_DBG("Writing, mutex unlocked");

 	return 0;
 }
@@ -633,16 +642,21 @@ static int sdl_display_blanking_off(const struct device *dev)

 	LOG_DBG("Turning display blacking off");
 	disp_data->display_on = true;
+    LOG_ERR("Blanking off, mutex lock");
 	k_mutex_lock(&disp_data->task_mutex, K_FOREVER);

 	if (k_current_get() == &disp_data->sdl_thread) {
 		exec_sdl_task(dev, &task);
 	} else {
+        LOG_ERR("Blanking off, put into msgq");
 		k_msgq_put(disp_data->task_msgq, &task, K_FOREVER);
+        LOG_ERR("Blanking off, wait for sem");
 		k_sem_take(&disp_data->task_sem, K_FOREVER);
+        LOG_ERR("Blanking off, sem taken");
 	}

 	k_mutex_unlock(&disp_data->task_mutex);
+    LOG_ERR("Blanking off, mutex unlock");

 	return 0;
 }
@@ -705,7 +719,9 @@ static int sdl_display_set_pixel_format(const struct device *dev,
 	case PIXEL_FORMAT_RGB_565:
 	case PIXEL_FORMAT_BGR_565:
 	case PIXEL_FORMAT_L_8:
+		k_mutex_lock(&disp_data->task_mutex, K_FOREVER);
 		disp_data->current_pixel_format = pixel_format;
+        k_mutex_unlock(&disp_data->task_mutex);
 		return 0;
 	default:
 		LOG_ERR("Pixel format not supported");

And running without the sleep:

[00:00:00.000,000] <err> display_sdl: SDL thread started, zoom 100
[00:00:00.000,000] <err> display_sdl: SDL thread waiting for message
[00:00:01.010,000] <err> display_sdl: Blanking off, mutex lock
[00:00:01.010,000] <err> display_sdl: Blanking off, put into msgq
[00:00:01.010,000] <err> display_sdl: Blanking off, wait for sem
[00:00:01.010,000] <err> display_sdl: SDL thread message received
[00:00:01.010,000] <err> display_sdl: SDL thread message executed
[00:00:01.010,000] <err> display_sdl: SDL thread waiting for message
[00:00:01.010,000] <err> display_sdl: Blanking off, sem taken
[00:00:01.010,000] <err> display_sdl: Blanking off, mutex unlock
[00:00:01.010,000] <err> display_sdl: Writing, take mutex
[00:00:01.010,000] <err> display_sdl: Writing, put message
[00:00:01.010,000] <err> display_sdl: Writing, wait for semaphore
[00:00:01.010,000] <err> display_sdl: SDL thread message received
[00:00:01.010,000] <err> display_sdl: SDL thread message executed
[00:00:01.010,000] <err> display_sdl: SDL thread waiting for message
[00:00:01.010,000] <err> display_sdl: Writing, semaphore taken

vs. running with sleep:

[00:00:00.000,000] <err> display_sdl: Blanking off, mutex lock
[00:00:00.000,000] <err> display_sdl: Blanking off, put into msgq
[00:00:00.000,000] <err> display_sdl: Blanking off, wait for sem
[00:00:00.000,000] <err> display_sdl: SDL thread started, zoom 100
[00:00:00.000,000] <err> display_sdl: SDL thread waiting for message
[00:00:00.000,000] <err> display_sdl: SDL thread message received
[00:00:00.000,000] <err> display_sdl: SDL thread message executed
[00:00:00.000,000] <err> display_sdl: SDL thread waiting for message
[00:00:00.000,000] <err> display_sdl: Blanking off, sem taken
[00:00:00.000,000] <err> display_sdl: Blanking off, mutex unlock
[00:00:00.000,000] <err> display_sdl: Writing, take mutex
[00:00:00.000,000] <err> display_sdl: Writing, put message
[00:00:00.000,000] <err> display_sdl: Writing, wait for semaphore
[00:00:00.000,000] <err> display_sdl: SDL thread message received
[00:00:00.000,000] <err> display_sdl: SDL thread message executed
[00:00:00.000,000] <err> display_sdl: SDL thread waiting for message
[00:00:00.000,000] <err> display_sdl: Writing, semaphore taken

So from what I see the main difference is that the thread starts later. The issue seems to be that the SDL thread is not started unless the main thread is blocked for a short period.

@Finomnis
Copy link
Contributor

Does @faxe1008's PR fix your issue?

@Finomnis
Copy link
Contributor

Finomnis commented Apr 22, 2025

Is this a thread priority issue? Does the SDL thread have a lower priority than main?

@VynDragon
Copy link
Collaborator Author

VynDragon commented Apr 22, 2025

No it does not fix the issue, note this is happening with all color schemes, not just L8 (it was just what i was using when i hit this, i also tried rgb 565)
I also tried to change CONFIG_SDL_DISPLAY_THREAD_PRIORITY to 25 and 75 and that didnt fix it either.

Would system configuration be relevant? I'm using sdl2-compat with SDL3 for sdl.

@aescolar
Copy link
Member

I also tried to change CONFIG_SDL_DISPLAY_THREAD_PRIORITY to 25 and 75 and that didnt fix it either.

The other way around:
https://docs.zephyrproject.org/latest/kernel/services/threads/index.html#thread-priorities
Smaller numbers mean higher priority

@VynDragon
Copy link
Collaborator Author

ah mb i thought it was centered on 50 for some reason.
It works with -25 with and without the PR.

@aescolar
Copy link
Member

aescolar commented Apr 22, 2025

@faxe1008 @Finomnis I think we should just increase the default CONFIG_SDL_DISPLAY_THREAD_PRIORITY to 0 or even to -CONFIG_NUM_COOP_PRIORITIES
(Note this discussion #87883 (comment) )

@danieldegrasse danieldegrasse added the priority: low Low impact/importance bug label Apr 22, 2025
@Finomnis
Copy link
Contributor

How about CONFIG_SDL_DISPLAY_THREAD_PRIORITY=14? Does this also fix the issue?

@VynDragon
Copy link
Collaborator Author

VynDragon commented Apr 22, 2025

How about CONFIG_SDL_DISPLAY_THREAD_PRIORITY=14? Does this also fix the issue?

It does not, it starts working with CONFIG_SDL_DISPLAY_THREAD_PRIORITY=-1 without the PR, and 0 with the PR.

@Finomnis
Copy link
Contributor

Finomnis commented Apr 22, 2025

While this feels like a fix for you, I suspect it's actually more of a workaround. Why would the display thread not run at all, even at lower priority?

I mean, if the main thread has a higher priority than the display thread (which is sane imo, display usually is less real time critical than the actual control logic of a system), then of course the display thread does not do anything while main is active. I do not consider this a bug.

On the other hand, if you have the display thread be a very high priority, maybe even a coop priority, then every other task, no matter how low its own priority is, has the ability to schedule high priority work through the display API. This is not really good for the reliability of an RTOS imo, but maybe that last point is incorrect. If someone else has thinks my opinion is wrong here, feel free to tell me why.

@VynDragon
Copy link
Collaborator Author

VynDragon commented Apr 22, 2025

I mean, if the main thread has a higher priority than the display thread (which is sane imo, display usually is less real time critical than the actual control logic of a system), then of course the display thread does not do anything while main is active. I do not consider this a bug.

Negatory, sleeping or yielding in main thread does not cause display to update when main is done, causes display to stay blank forever, eg it does not initialize properly and the display write (and future display writes) are not taken into account at all.

@VynDragon
Copy link
Collaborator Author

Or put differently, device_is_ready is false because it is not okay to write to display and breaks the display when done, but true is returned anyway.

@VynDragon
Copy link
Collaborator Author

VynDragon commented Apr 22, 2025

sleeping or yielding in main thread

Actually it doesnt update there either with -1 priority, only when main thread is exited.
Workarounds (waiting 1s or blanking on then blanking off) do still work with yield or sleep though.

Edit: that applies witrhout the PR, with the PR the sleep and yield work without exiting main thread when setting priority to 0 or -1.
Eg, it needs both the PR and the priority change to work properly.

@Finomnis
Copy link
Contributor

Finomnis commented Apr 23, 2025

it does not initialize properly and the display write (and future display writes) are not taken into account at all.

Yes, exactly. That's why I don't think it's a thread priority issue but a bug somewhere else. Because if thread priority was the issue, the writes would all happen after main is done.

device_is_ready is false

That sounds like the real issue.

Is it possible to temporarily lift thread priority by waiting on it from another higher priority thread?

It sound like there's a synchronization missing; wherever the device gets initialized should wait until the thread has performed the SDL initialization. It should never reach main before that.

That's why I think the priority modification is just a hack; it happens to reorder actions in the correct order by accident, while the real solution would be a synchronization primitive that enforces order regardless of priority.

@faxe1008
Copy link
Collaborator

Yes the issue is that the display drivers thread needs to run at least once before reaching the main.
The fix for this is to yield inside of the driver init and change the priority - aside from that there is no other means of
forcing a thread to be scheduled right?

@Finomnis
Copy link
Contributor

Finomnis commented Apr 23, 2025

Well you can wait with a semaphore or similar inside of the driver init, until the initialization is finished.

@faxe1008
Copy link
Collaborator

Adjusted the PR, with all the changes.

@Finomnis
Copy link
Contributor

@VynDragon Does the new version of #88886 fix your problem?

@VynDragon
Copy link
Collaborator Author

yes it does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Display area: native port Host native arch port (native_sim) bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants