Skip to content

net_mgmt: event handling is broken #88534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JordanYates opened this issue Apr 12, 2025 · 9 comments
Open

net_mgmt: event handling is broken #88534

JordanYates opened this issue Apr 12, 2025 · 9 comments
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@JordanYates
Copy link
Collaborator

JordanYates commented Apr 12, 2025

Describe the bug

The current mechanism for subscribing to net_mgmt events is currently broken, in that code that register for callbacks (net_mgmt_init_event_callback) on certain events will also result in callbacks being run on other, seemingly random events. This also affects the event waiting functions, net_mgmt_event_wait and net_mgmt_event_wait_on_iface.

The core of the problem is that these APIs are defined at the bitmask level, i.e. the internal code is checking bitmasks to determine whether an event matches the requested events:

!(NET_MGMT_GET_COMMAND(mgmt_event->event) &
NET_MGMT_GET_COMMAND(cb->event_mask)))) {

The problem is that ALL of the events are defined as integer values, not bitmasks. The following example is from the Wi-Fi layer, but it is the same for all the others:

/** @brief Wi-Fi management events */
enum net_event_wifi_cmd {
/** Scan results available */
NET_EVENT_WIFI_CMD_SCAN_RESULT = 1,
/** Scan done */
NET_EVENT_WIFI_CMD_SCAN_DONE,

/** Event emitted for Wi-Fi scan result */
#define NET_EVENT_WIFI_SCAN_RESULT \
(_NET_WIFI_EVENT | NET_EVENT_WIFI_CMD_SCAN_RESULT)
/** Event emitted when Wi-Fi scan is done */
#define NET_EVENT_WIFI_SCAN_DONE \
(_NET_WIFI_EVENT | NET_EVENT_WIFI_CMD_SCAN_DONE)

As a result, if you register for callbacks like so, net_mgmt_init_event_callback(&cb, wifi_mgmt_event_handler, NET_EVENT_WIFI_CMD_AP_STA_CONNECTED);, where NET_EVENT_WIFI_CMD_AP_STA_CONNECTED == 15, you end up being notified when any of the following events occur, since they all share bits with 15.

NET_EVENT_WIFI_CMD_SCAN_RESULT
NET_EVENT_WIFI_CMD_SCAN_DONE
NET_EVENT_WIFI_CMD_CONNECT_RESULT
NET_EVENT_WIFI_CMD_DISCONNECT_RESULT
NET_EVENT_WIFI_CMD_IFACE_STATUS
NET_EVENT_WIFI_CMD_TWT
NET_EVENT_WIFI_CMD_TWT_SLEEP_STATE
NET_EVENT_WIFI_CMD_RAW_SCAN_RESULT
NET_EVENT_WIFI_CMD_DISCONNECT_COMPLETE
NET_EVENT_WIFI_CMD_SIGNAL_CHANGE
NET_EVENT_WIFI_CMD_NEIGHBOR_REP_RECEIVED
NET_EVENT_WIFI_CMD_NEIGHBOR_REP_COMPLETE
NET_EVENT_WIFI_CMD_AP_ENABLE_RESULT
NET_EVENT_WIFI_CMD_AP_DISABLE_RESULT
NET_EVENT_WIFI_CMD_AP_STA_CONNECTED

Potential Solutions

  1. Convert events to bitmasks

Converting events from integers to bitmasks (as implied by the API), would resolve the issue.
Unfortunately, the command mask only has space for 16 unique events per layer:

#define NET_MGMT_COMMAND_MASK 0x0000FFFF

The Wi-Fi management layer already defines 17 unique events, and the IPv6 layer defines over 20.

  1. Convert management event mask to 64 bits, convert events to bitmasks

Converting the event mask to 64 bits would give up to 32 additional possible masks to each layer.
This would be a rather intrusive change, requiring updates to every callback definition to change the API signature to accept the 64bit parameter.
Concerns have also been raise that we will eventually run out of events again: #88495 (comment)

  1. Convert net_mgmt_init_event_callback to register for layers, not events

Update the semantics of net_mgmt_init_event_callback so that instead of registering for individual events, users register for the base layer and perform their own event filtering in the callback. This would allow the event IDs to remain integers, but it a significant API change.

It additionally would make the waiting functions (net_mgmt_event_wait, net_mgmt_event_wait_on_iface) not possible to implement for more than a single event (not that they currently work properly).

Additional context

Initial attempt to fix: #88495

@JordanYates JordanYates added bug The issue is a bug, or the PR is fixing a bug area: Networking labels Apr 12, 2025
@nashif nashif added the priority: medium Medium impact/importance bug label Apr 13, 2025
@jukkar
Copy link
Member

jukkar commented Apr 14, 2025

you end up being notified when any of the following events occur, since they all share bits with 15.

Yes, but that is why in the event callback one needs to check what event if being fired. This is documented in https://docs.zephyrproject.org/latest/connectivity/networking/api/net_mgmt.html#listening-to-network-events

@JordanYates
Copy link
Collaborator Author

you end up being notified when any of the following events occur, since they all share bits with 15.

Yes, but that is why in the event callback one needs to check what event if being fired. This is documented in https://docs.zephyrproject.org/latest/connectivity/networking/api/net_mgmt.html#listening-to-network-events

It may be documented there, but it isn't mentioned anywhere on the function doxygen. If you can't properly differentiate between different events as part of the API, why pretend that you can? It should just be subscribing for events at the "layer code" level.

There is also the question of what the semantics of net_mgmt_event_wait are, if the current behaviour is expected. Instead of Used to wait synchronously on an event mask it should be Used to wait synchronously on any event with the same layer code. The current usage in tree certainly seems to be assuming that exactly the event requested is being blocked on.

printk("Waiting for L4 connected\n");
ret = net_mgmt_event_wait_on_iface(iface, NET_EVENT_L4_CONNECTED, NULL, NULL, NULL,
K_SECONDS(120));
if (ret != 0) {
printk("L4 was not connected in time\n");
return -1;
}
printk("Waiting for DNS server added\n");
ret = net_mgmt_event_wait_on_iface(iface, NET_EVENT_DNS_SERVER_ADD, NULL, NULL, NULL,
K_SECONDS(10));
if (ret) {
printk("DNS server was not added in time\n");
return -1;
}

@carlescufi
Copy link
Member

Why not change the registration function to take an arbitrary-length bit array so that we are not stuck at 64 events max? This is essentially what Bluetooth does today.

@jukkar
Copy link
Member

jukkar commented Apr 16, 2025

Robert and I pondered this a bit offline and we need to overhaul this as Jordan noticed.

  • It seems that mapping events to bit is a possible solution. How do we make this extensible (perhaps bitarray? https://github.com/zephyrproject-rtos/zephyr/blob/main/include/zephyr/sys/bitarray.h)
  • What would the user API look in this case, do we need a new API for making this work properly. In which case we should probably deprecate the old API unless it could be converted to use this new API somehow.
  • We should probably have one central header file that is used to allocate the different layer codes and which could act as a registry. Now the layer codes are scattered around which makes it difficult to assure that there are no duplicate layers. Also we could allocate some layers for out-of-tree code

Comments / ideas?

@JordanYates
Copy link
Collaborator Author

Comments / ideas?

My 2 cents. Callbacks registered at the "layer code" level, with individual events filtered out inside the callbacks.
Since there would be no longer be a need for everything to fit in a uint32_t, more bits could be allocated to the layers to optimize the internal processing, which right now is a bit of a mess (global_event_mask is a bit of a travesty atm).
event_wait (or a new function name) would take the layer code and event ID as individual parameters, and loop internally with a k_timepoint_t (the looping could be transitioned to something more optimal if there is a need).

@jukkar
Copy link
Member

jukkar commented Apr 16, 2025

@JordanYates IMHO your proposal sounds reasonable. Looking at the PR #88495, it would need some tweaks, are you able to work on this?

@carlescufi
Copy link
Member

@JordanYates gentle ping, will you update the PR?

@JordanYates
Copy link
Collaborator Author

@JordanYates IMHO your proposal sounds reasonable. Looking at the PR #88495, it would need some tweaks, are you able to work on this?

Sorry, I don't really have the time at the moment to make changes with that sort of scope, there are almost 100 instances of net_mgmt_add_event_callback in tree.

@jukkar
Copy link
Member

jukkar commented Apr 24, 2025

@JordanYates no worries, I can take a look at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

5 participants