Skip to content

zbus: Add runtime observers pool #88834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rodrigopex
Copy link
Collaborator

The aim of PR #84271 (#79777) was to eliminate the need for the heap during runtime when adding a runtime observer. That was accomplished. However, the change impacted the API and the consistency of the runtime observers, which, after the PR, did not behave like the other types of observers.

After some analysis, community comments, and questions, I could confirm that the feature is proper, but the implementation has broken API and behavior consistency, negatively impacting the zbus quality.

As we are in the same cycle (Zephyr 4.1.99), I believe that it would be better to adjust that to implement the feature but with no effect on the API or the runtime observer's behavior.

@rodrigopex rodrigopex force-pushed the zbus_runtime_observers_pool branch 2 times, most recently from 57113c9 to 1c5ab6d Compare April 20, 2025 12:48
@rodrigopex rodrigopex marked this pull request as ready for review April 20, 2025 12:55
@github-actions github-actions bot added area: Zbus Release Notes To be mentioned in the release notes area: Samples Samples labels Apr 20, 2025
@rodrigopex rodrigopex requested a review from pdgendt April 20, 2025 12:59
@rodrigopex rodrigopex force-pushed the zbus_runtime_observers_pool branch from 1c5ab6d to 4fd5b20 Compare April 20, 2025 13:13
@kartben kartben requested a review from JordanYates April 20, 2025 13:16
@rodrigopex rodrigopex force-pushed the zbus_runtime_observers_pool branch from 4fd5b20 to d91b186 Compare April 20, 2025 14:03
@JordanYates
Copy link
Collaborator

However, the change impacted the API and the consistency of the runtime observers, which, after the PR, did not behave like the other types of observers.

How does the current behavior not act like other types of observers?

the implementation has broken API

zbus is not a stable API, and while that is not an argument for changing API's for no reason, it also is not a reason to not make changes.

What is the advantage of hiding the memory allocation inside the library? The goal of zbus (or at least one of them) is for connecting various components of an application together without necessarily needing to understand the deep internals of each one.

In such a situation, how large should ZBUS_RUNTIME_OBSERVERS_NODE_POOL_SIZE be? Is the goal for users to recompile and re-run their applications multiple times, bumping the value by 1 until allocation no longer fails? What about when one of the runtime users is updated and needs more runtime nodes?

To me, the "overhead" of needing to declare an extra variable somewhere is vastly preferable to needing to mess around with some variably sized Kconfig option after every update.

Honestly, the above is my biggest pain-point with Zephyr as a whole. Every time dynamic allocation from a shared resource creeps in somewhere, you have no clue whether an application will work after an update until you do exhaustive testing on every path in your application.

@rodrigopex
Copy link
Collaborator Author

rodrigopex commented Apr 21, 2025

How does the current behavior not act like other types of observers?

Original signature, the one I would like to keep:

ZBUS_CHAN_ADD_OBS(channel, observer, priority);
zbus_chan_add_obs(const struct zbus_channel *chan, const struct zbus_observer *obs, k_timeout_t timeout);

In the original signature, developers only need to pay attention to the essential thing channel, observers, and priority for static and timeout for runtime. You can call both arbitrarily if using dynamic allocation, which is the regular usage, as message subscribers rely on the heap, and they are usually enabled. The limitation is the heap size.

The static allocation for zbus (message subscribers and runtime observers) is an exceptional usage. That is meant for applications where dynamic allocation is prohibited due to safety reasons, memory constraints, or other reasons. In this scenario, it is normal for developers to calculate array sizes for almost everything, as they need to be statically allocated.

The changed signature affected the API (unnecessarily), added an unrelated variable (node), and relied on an error-prone approach. We are bringing the complexity only necessary for the exceptional usage (heap prohibited) for all users unnecessarily.

int zbus_chan_add_obs(const struct zbus_channel *chan, const struct zbus_observer *obs,
		      struct zbus_observer_node *node, k_timeout_t timeout);

Pain points of the changed signature:

  • Developers need to deal with internal aspects of zbus (linked list), manipulating an unrelated variable.
  • The node variable must be used only once. If used twice, the code does not raise errors or warnings. It simply overwrites the current observation with the latest channel and observer, invalidating the previous call.
  • Developers need to be aware of the node scope. If the scope is not well chosen, the code may break without notice, making the error difficult to find.

zbus is not a stable API, and while that is not an argument for changing API's for no reason, it also is not a reason to not make changes.

Usually, in programming, we avoid changing APIs unnecessarily. Several developers already use that as it is. Why change the API if we can keep it the same and still deliver the feature?

What is the advantage of hiding the memory allocation inside the library? The goal of zbus (or at least one of them) is for connecting various components of an application together without necessarily needing to understand the deep internals of each one.
In such a situation, how large should

Make that easier to use. With the node variable, the developers need to deal with more complexities: the meaning of the variable (since that is not related to the feature), the scope of the variable, and whether the variable is already used.

ZBUS_RUNTIME_OBSERVERS_NODE_POOL_SIZE be? Is the goal for users to recompile and re-run their applications multiple times, bumping the value by 1 until allocation no longer fails? What about when one of the runtime users is updated and needs more runtime nodes?

It is easy. To be safe, you can count in the code how many "zbus_chan_add_obs" you have and put that number in the configuration. You need to analyze the "add" and "rm" sequence calls to optimize them. It may not be worth it since each extra node will take only 8 bytes.

Again, let the user use the heap for flexibility and the stack when they need it, all without changing the way, and without adding unnecessary usage complexities.

@rodrigopex rodrigopex force-pushed the zbus_runtime_observers_pool branch from d91b186 to 341f11f Compare April 21, 2025 11:23
@JordanYates
Copy link
Collaborator

Original signature, the one I would like to keep:

Nothing you describe is a behavioral change, just consequences from the extra parameter in the function.
If that's the point you want to make, fine, but don't claim runtime observers act any differently as a result of the change.
The tests didn't change (aside from the extra parameter), that's just a fact.

It is easy. To be safe, you can count in the code how many "zbus_chan_add_obs"

I disagree that this is "easy". Doing a grep for zbus_chan_add_obs will both under and over count the true requirement in different situations.

  1. Over count: not every usage of zbusu_chan_add_obs will be compiled into an application
  2. Under count: this method completely ignores control flow (loops, function calls, etc)

And sure, you could just set the value to something like 16 and almost never hit an issue, but 128 bytes is not nothing.

Again, let the user use the heap for flexibility and the stack when they need it

How does the user use the stack in this proposal? From what I can tell the only options are the system heap or the new pool you are proposing.

@rodrigopex
Copy link
Collaborator Author

I disagree that this is "easy". Doing a grep for zbus_chan_add_obs will both under and over count the true requirement in different situations.

  1. Over count: not every usage of zbusu_chan_add_obs will be compiled into an application
  2. Under count: this method completely ignores control flow (loops, function calls, etc)

The suggested approach works for almost every application. It is not usual to call that functions (add or rm) in loops. Flow control is more feasible and easier to manage. You can suggest/use another approach if you like.

The truth is that the node approach would not make the loop or flow control easier, since this lifetime would be an issue. You will end up needing an array of nodes...

And sure, you could just set the value to something like 16 and almost never hit an issue, but 128 bytes is not nothing.

If 128 bytes is an issue in your application, you can reduce it to the appropriate value. You just need to analyze or test the code more.

How does the user use the stack in this proposal? From what I can tell the only options are the system heap or the new pool you are proposing.

The pool is an array of nodes statically allocated in the stack and managed by the memory slab.

@rodrigopex
Copy link
Collaborator Author

Talking about the pain points of the node. Take a look at the sample code. This could be easly found in developers code:

/*
 * Copyright (c) 2022 Rodrigo Peixoto <[email protected]>
 * SPDX-License-Identifier: Apache-2.0
 */

#include "messages.h"

#include <stdint.h>

#include <zephyr/kernel.h>
#include <zephyr/logging/log.h>
#include <zephyr/zbus/zbus.h>
LOG_MODULE_DECLARE(zbus, CONFIG_ZBUS_LOG_LEVEL);

ZBUS_CHAN_DEFINE(chan_foo, struct msg_foo, NULL, NULL, ZBUS_OBSERVERS_EMPTY, ZBUS_MSG_INIT(.x = 0));

static void lis_bar_callback(const struct zbus_channel *chan)
{
	const struct msg_foo *msg = zbus_chan_const_msg(chan);

	LOG_INF(" Channel %s foo message value %d", zbus_chan_name(chan), msg->x);
}

ZBUS_LISTENER_DEFINE(lis_bar, lis_bar_callback);

void system_init()
{
        /* This can be a usual error for new users. The node variable goes out of scope... */
	struct zbus_observer_node obs_node;

	zbus_chan_add_obs(&chan_foo, &lis_bar, &obs_node, K_MSEC(200));

	LOG_INF("System started");
}

int main(void)
{
	system_init();

	struct msg_foo msg = {.x = 1};

	while (1) {
		zbus_chan_pub(&chan_foo, &msg, K_FOREVER);

		msg.x += 1;

		k_msleep(1000);
	}
	return 0;
}

The error retuner by that is:

 west build -t run
-- west build: running target run
[0/1] To exit from QEMU enter: 'CTRL+a, x'[QEMU] CPU: qemu32,+nx,+pae
SeaBIOS (version zephyr-v1.0.0-0-g31d4e0e-dirty-20200714_234759-fv-az50-zephyr)
Booting from ROM..
I: System started
E: Page fault at address 0x103cf4 (error code 0x3)
E: Access violation: supervisor thread not allowed to write
E: PTE: 0x103000 -> 0x0000000000103000: US A
E: EAX: 0x001120e0, EBX: 0x00103ce4, ECX: 0xffffffff, EDX: 0x00103cf4
E: ESI: 0x0011f3dc, EDI: 0xffffffff, EBP: 0x0011f394, ESP: 0x0011f394
E: EFLAGS: 0x00000002 CS: 0x0008 CR3: 0x00111000
E: EIP: 0x00105de2
E: call trace:
E:      0: 0x00103f67
E:      1: 0x001024f7
E:      2: 0x001025f3
E:      3: 0x001028ca
E:      4: 0x0010031c
E:      5: 0x00103ce4
E:      6: 0x001006c6
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x1120e0 (unknown)
E: Halting system

No clues, nothing to help developers see where to start looking for the defect.

@JordanYates
Copy link
Collaborator

The pool is an array of nodes statically allocated in the stack and managed by the memory slab.

Memory slabs and stacks have nothing to do with each other, its certainly not where the memory comes from, but w/e.

No clues, nothing to help developers see where to start looking for the defect.

No clue, apart from the call trace, program counter and the zephyr.map/zephyr.lst files.

If we're talking about consistency, the node method is how every other "notification" mechanism I have seen in Zephyr works.
You provide the memory used for the dynamic notification, you are responsible for making sure it doesn't go out of scope.
There are no internal k_malloc calls or arbitrarily sized memory pools.

@rodrigopex
Copy link
Collaborator Author

rodrigopex commented Apr 21, 2025

The other pain point is reusing the same node. In this sample, that causes a zbus execution issue:

/*
 * Copyright (c) 2022 Rodrigo Peixoto <[email protected]>
 * SPDX-License-Identifier: Apache-2.0
 */

#include "messages.h"

#include <stdint.h>

#include <zephyr/kernel.h>
#include <zephyr/logging/log.h>
#include <zephyr/zbus/zbus.h>
LOG_MODULE_DECLARE(zbus, CONFIG_ZBUS_LOG_LEVEL);

ZBUS_CHAN_DEFINE(chan_foo, struct msg_foo, NULL, NULL, ZBUS_OBSERVERS_EMPTY, ZBUS_MSG_INIT(.x = 0));

static void lis_bar_callback(const struct zbus_channel *chan)
{
	const struct msg_foo *msg = zbus_chan_const_msg(chan);

	LOG_INF("[BAR] Channel %s, message value %d", zbus_chan_name(chan), msg->x);
}

ZBUS_LISTENER_DEFINE(lis_bar, lis_bar_callback);

static void lis_baz_callback(const struct zbus_channel *chan)
{
	const struct msg_foo *msg = zbus_chan_const_msg(chan);

	LOG_INF("[BAZ] Channel %s, message value %d", zbus_chan_name(chan), msg->x);
}

ZBUS_LISTENER_DEFINE(lis_baz, lis_baz_callback);

static void lis_regular_callback(const struct zbus_channel *chan)
{
	const struct msg_foo *msg = zbus_chan_const_msg(chan);

	LOG_INF("[Regular] Channel %s, message value %d", zbus_chan_name(chan), msg->x);
}

ZBUS_LISTENER_DEFINE(lis_regular, lis_regular_callback);

ZBUS_CHAN_ADD_OBS(chan_foo, lis_regular, 3);

int main(void)
{
	LOG_INF("System started");

	struct zbus_observer_node obs_node;
       
        /* Reuse the same obs_node. This can also be a usual issue for new developers */
	zbus_chan_add_obs(&chan_foo, &lis_bar, &obs_node, K_MSEC(200));
	zbus_chan_add_obs(&chan_foo, &lis_baz, &obs_node, K_MSEC(200));

	struct msg_foo msg = {.x = 1};

	while (1) {
		zbus_chan_pub(&chan_foo, &msg, K_FOREVER);

		msg.x += 1;

		k_msleep(1000);
	}
	return 0;
}

The correct output if you have one node per addition is:

I: System started
I: [Regular] Channel chan_foo, message value 1
I: [BAR] Channel chan_foo, message value 1
I: [BAZ] Channel chan_foo, message value 1
I: [Regular] Channel chan_foo, message value 2
I: [BAR] Channel chan_foo, message value 2
I: [BAZ] Channel chan_foo, message value 2
I: [Regular] Channel chan_foo, message value 3
I: [BAR] Channel chan_foo, message value 3
I: [BAZ] Channel chan_foo, message value 3
I: [Regular] Channel chan_foo, message value 4
I: [BAR] Channel chan_foo, message value 4
I: [BAZ] Channel chan_foo, message value 4
... <continue>

However, the output for this code is:

I: System started
I: [BAZ] Channel chan_foo, message value 1
I: [BAZ] Channel chan_foo, message value 1
I: [BAZ] Channel chan_foo, message value 1
I: [BAZ] Channel chan_foo, message value 1
I: [BAZ] Channel chan_foo, message value 1
I: [BAZ] Channel chan_foo, message value 1
I: [BAZ] Channel chan_foo, message value 1
I: [BAZ] Channel chan_foo, message value 1
I: [BAZ] Channel chan_foo, message value 1
... <continue>

So again, no warnings, nothing to give the user a clue of how to solve the issue. The issue affected the zbus VDED in a way that prevented it from delivering any messages. I suspect it is in a loop. It needs further investigation.

@rodrigopex rodrigopex force-pushed the zbus_runtime_observers_pool branch 2 times, most recently from 35fc548 to bc40168 Compare April 21, 2025 13:00
@rodrigopex
Copy link
Collaborator Author

Memory slabs and stacks have nothing to do with each other, its certainly not where the memory comes from, but w/e.

Stack vs heap: If the mutable data is not in the heap, it is in the stack, right? I am not talking about the stack kernel object.

No clue, apart from the call trace, program counter and the zephyr.map/zephyr.lst files.

Rephrasing that to "no friendly clues", mainly for newcomers.

If we're talking about consistency, the node method is how every other "notification" mechanism I have seen in Zephyr works. You provide the memory used for the dynamic notification, you are responsible for making sure it doesn't go out of scope. There are no internal k_malloc calls or arbitrarily sized memory pools.

I made my point. Let's see what the others think about that. Thanks for the discussion.

@rodrigopex rodrigopex self-assigned this Apr 21, 2025
@rodrigopex rodrigopex changed the title Zbus runtime observers pool zbus: Add runtime observers pool Apr 21, 2025
@rodrigopex rodrigopex force-pushed the zbus_runtime_observers_pool branch 3 times, most recently from eb2c38f to 420f71b Compare April 21, 2025 14:15
A previous PR merged (to remove runtime observers' dependency with heap)
added inconsistencies and compatibility breaks to the zbus. This commit
improves that by removing the inconsistencies and still attending to the
features requested by the community.

Signed-off-by: Rodrigo Peixoto <[email protected]>
Reverse the changes the affected API and add test to the new dynamic and
static variations.

Signed-off-by: Rodrigo Peixoto <[email protected]>
Reverse the changes the affected API and add adjust the sample to the new
dynamic and static variations.

Signed-off-by: Rodrigo Peixoto <[email protected]>
No migration necessary for this feature.

Signed-off-by: Rodrigo Peixoto <[email protected]>
Add documentation of the runtime observers pool.

Signed-off-by: Rodrigo Peixoto <[email protected]>
Add runtime observers static allocation feature to release notes.

Signed-off-by: Rodrigo Peixoto <[email protected]>
@rodrigopex rodrigopex force-pushed the zbus_runtime_observers_pool branch from 420f71b to ec6dcec Compare April 22, 2025 07:49
@pdgendt
Copy link
Collaborator

pdgendt commented Apr 22, 2025

There are pros/cons to both approaches, maybe it could be an option to have both?

  • Revert the API changes
  • Add new API function to allow passing a node/observation/...
  • Extend the ZBUS_RUNTIME_OBSERVERS_NODE_ALLOC choice with ZBUS_RUNTIME_OBSERVERS_NODE_ALLOC_NONE to disable the allocation

These are my two cents, I just didn't like the k_malloc.

@rodrigopex
Copy link
Collaborator Author

There are pros/cons to both approaches, maybe it could be an option to have both?

  • Revert the API changes
  • Add new API function to allow passing a node/observation/...
  • Extend the ZBUS_RUNTIME_OBSERVERS_NODE_ALLOC choice with ZBUS_RUNTIME_OBSERVERS_NODE_ALLOC_NONE to disable the allocation

These are my two cents, I just didn't like the k_malloc.

Neither do I at first. Now I'm used to it, as it's being used in a very controlled way.

I hear you. I was thinking of an approach that lets the developer allocate that and still provides an abstraction layer to make it simple and less error-prone. I will return with something soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Samples Samples area: Zbus Release Notes To be mentioned in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants