-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Bluetooth: Mesh: Rework publication timer #34310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bluetooth: Mesh: Rework publication timer #34310
Conversation
74e06a4
to
1e4839a
Compare
@@ -632,12 +632,12 @@ int bt_mesh_trans_send(struct bt_mesh_net_tx *tx, struct net_buf_simple *msg, | |||
return -EINVAL; | |||
} | |||
|
|||
if (msg->len > BT_MESH_TX_SDU_MAX) { | |||
BT_ERR("Not enough segment buffers for length %u", msg->len); | |||
if (msg->len > BT_MESH_TX_SDU_MAX - BT_MESH_MIC_SHORT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unrelated to the rework of the timer, is this a bug-fix?
Should be it's own commit if it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split into its own commit.
What about the TBDs? Is is still to be decided? |
Yes, there's a problem there. I'm looking at it to see how it can be clearly described and to identify potential solutions. |
|
@trond-snekvik here are my notes on that: There appear to be data races with the publish state in In
In
It's the TBD in If the
I'm not sure whether there are races in a single-processor with non-preemptive threads. There could be if any of the operations invoked by |
Periodic publication would previously build and send the first publication inside the bt_mesh_model_pub() function, before cancelling and rescheduling the next publication. The timer handler would only handle retransmissions, and would abandon the rest of the publication event if one of the packets failed to send. This design has three issues: - If the initial timer cancel fails, the publication would interfer with the periodic publication management, which might skip an event or send too many packets. - If any of the messages fail to publish, the full publication event would be abandoned. This is not predictable or expected from the API. - bt_mesh_model_pub() required 384 bytes of stack to build the message, which has to be factored into all calling threads. This patch moves all transmission into the publication timer by replacing k_work_cancel with a single k_work_reschedule(K_NO_WAIT). It also changes the error recovery behavior to attempt to finish the full publication event even if some of the transmissions fail. Split out from zephyrproject-rtos#33782. Signed-off-by: Trond Einar Snekvik <[email protected]>
The Transport layer would previously rely on the access layer to check whether there's room for the full message and a MIC in the available buffer space, and its own checks would ignore the MIC. This should be handled by the Transport layer checks, so the access layer doesn't have to. Signed-off-by: Trond Einar Snekvik <[email protected]>
1e4839a
to
799c804
Compare
@pabigot thanks for the thorough notes.
To answer your question - the callbacks are either called from the cooperative advertising thread (adv_legacy.c), from a work handler, or inline in the I have reviewed the TBDs in main and cfg_srv, and added a check for disabled and suspended publication in the work handler to catch any cancellation failures. You're right about the count variable, it's going to have a race condition if we call publish() from a preemptive thread, but this is far from the only SMP issue in the mesh stack. In fact, the entire Bluetooth stack is full of SMP and preemptive scheduling issues, and calling any Bluetooth API from a preemptive thread is effectively undefined behavior. Changing this would be more work than we'd be able to do for 2.6, and I'd rather prioritize the stuff we know we can fix. Now, for the cooperative scheduling case: If any of the cancel calls fail, we may wait for the work item to fire (and return early) before the publication state comes to rest. In the config server, I believe this is unproblematic, as the on-air interface cannot guarantee atomic control anyway. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given recognition that SMP-related race conditions are still present, I think this does a good job of reducing problems from preemptible threads. Documenting what happens when the cancel operation fails is also a good practice.
When Light LC Mode Set message is sent with the value Off, 2 messages are published: Light LC Light OnOff Status and Light LC Mode Status. After publication has been moved from the caller thread to work queue in zephyrproject-rtos/zephyr#34310 it became impossible to publish 2 messages consecutively. However, according to the changes in the model spec, section 6.5.1.1 (see Errata ID 11372), Light LC Server should not use publication for configuration data: Changes in the Light LC Mode state, or in the Light LC Occupancy Mode state, or in the Light LC Light OnOff state shall not trigger publication of the Light LC Mode Status, or Light LC OM Status, or Light LC Light OnOff Status messages. This change removes publication of Mode and Occupancy Mode states upon reception of corresponding Set messages. Light OnOff state is published according to state machine's state changes. Signed-off-by: Pavel Vasilyev <[email protected]>
When Light LC Mode Set message is sent with the value Off, 2 messages are published: Light LC Light OnOff Status and Light LC Mode Status. After publication has been moved from the caller thread to work queue in zephyrproject-rtos/zephyr#34310 it became impossible to publish 2 messages consecutively. However, according to the changes in the model spec, section 6.5.1.1 (see Errata ID 11372), Light LC Server should not use publication for configuration data: Changes in the Light LC Mode state, or in the Light LC Occupancy Mode state, or in the Light LC Light OnOff state shall not trigger publication of the Light LC Mode Status, or Light LC OM Status, or Light LC Light OnOff Status messages. This change removes publication of Mode and Occupancy Mode states upon reception of corresponding Set messages. Light OnOff state is published according to state machine's state changes. Signed-off-by: Pavel Vasilyev <[email protected]>
After publication API change in zephyrproject-rtos/zephyr#34310 it became impossible to publish several messages in a row, because a new message substitues the current message that is going to be published by access layer. This change swaps sensor after each work timeout. Signed-off-by: Pavel Vasilyev <[email protected]>
After publication API change in zephyrproject-rtos/zephyr#34310 it became impossible to publish several messages in a row, because a new message substitues the current message that is going to be published by access layer. This change swaps sensor after each work timeout. Signed-off-by: Pavel Vasilyev <[email protected]>
Periodic publication would previously build and send the first
publication inside the bt_mesh_model_pub() function, before cancelling
and rescheduling the next publication. The timer handler would only
handle retransmissions, and would abandon the rest of the publication
event if one of the packets failed to send.
This design has three issues:
the periodic publication management, which might skip an event or
send too many packets.
would be abandoned. This is not predictable or expected from the API.
which has to be factored into all calling threads.
This patch moves all transmission into the publication timer by
replacing k_work_cancel with a single k_work_reschedule(K_NO_WAIT). It
also changes the error recovery behavior to attempt to finish the full
publication event even if some of the transmissions fail.
Split out from #33782.
Signed-off-by: Trond Einar Snekvik [email protected]