Bluetooth: Mesh: Rework publication timer #34310

trond-snekvik · 2021-04-15T10:28:47Z

Periodic publication would previously build and send the first
publication inside the bt_mesh_model_pub() function, before cancelling
and rescheduling the next publication. The timer handler would only
handle retransmissions, and would abandon the rest of the publication
event if one of the packets failed to send.

This design has three issues:

If the initial timer cancel fails, the publication would interfer with
the periodic publication management, which might skip an event or
send too many packets.
If any of the messages fail to publish, the full publication event
would be abandoned. This is not predictable or expected from the API.
bt_mesh_model_pub() required 384 bytes of stack to build the message,
which has to be factored into all calling threads.

This patch moves all transmission into the publication timer by
replacing k_work_cancel with a single k_work_reschedule(K_NO_WAIT). It
also changes the error recovery behavior to attempt to finish the full
publication event even if some of the transmissions fail.

Split out from #33782.

Signed-off-by: Trond Einar Snekvik [email protected]

joerchan · 2021-04-15T10:51:32Z

subsys/bluetooth/mesh/transport.c

@@ -632,12 +632,12 @@ int bt_mesh_trans_send(struct bt_mesh_net_tx *tx, struct net_buf_simple *msg,
 		return -EINVAL;
 	}

-	if (msg->len > BT_MESH_TX_SDU_MAX) {
-		BT_ERR("Not enough segment buffers for length %u", msg->len);
+	if (msg->len > BT_MESH_TX_SDU_MAX - BT_MESH_MIC_SHORT) {


This looks unrelated to the rework of the timer, is this a bug-fix?

Should be it's own commit if it is.

Split into its own commit.

joerchan · 2021-04-15T11:00:43Z

What about the TBDs? Is is still to be decided?

pabigot · 2021-04-15T11:01:44Z

What about the TBDs? Is is still to be decided?

Yes, there's a problem there. I'm looking at it to see how it can be clearly described and to identify potential solutions.

trond-snekvik · 2021-04-15T11:11:51Z

~~Looks like I missed that TBD, sorry about that. This mod_commit function is only called as part of the settings load procedure, and I think it's fair to use schedule with that.~~
Edit: Sorry, just realized I missed the changes in config server and main. Will review.

pabigot · 2021-04-15T11:36:56Z

Edit: Sorry, just realized I missed the changes in config server and main. Will review.

@trond-snekvik here are my notes on that:

There appear to be data races with the publish state in bt_mesh_model_pub, specifically but maybe not exclusively the count bitfield and period_start field.

In mesh/access.c:

bt_mesh_model_publish sets count and other fields and reschedules the work item
[WQ] mod_publish runs on the work queue and invokes pub_period_start, publish_transmit, and pub_period_sent
[WQ] publish_transmit invokes bt_mesh_trans_send with callbacks that invoke publish_start and publish_sent. (Are the callbacks invoked by the calling (WQ) thread?)
[WQ] pub_period_start sets count, invokes an update function(?), then sets more properties and invokes publish_sent
publish_sent reads count and schedules the work
publish_start invokes publish_sent, reads count, and updates period_start

In mesh/main.c:

model_suspend clears count and cancels the work

It's the TBD in mesh/main.c that caught my attention: AFAICT the work handler does check whether count is non-zero; but if it is zero (as will happen if the cancellation after setting count to zero fails) what might happen is the work item starts a new publish cycle.

If the publish_{start,sent} callbacks are only invoked (indirectly) by the work queue thread, then the potential races on an SMP system are:

bt_mesh_model_publish from access
model_suspend from main
the work queue thread

I'm not sure whether there are races in a single-processor with non-preemptive threads. There could be if any of the operations invoked by mod_publish sleep, because in that situation the cancellation could fail.

Periodic publication would previously build and send the first publication inside the bt_mesh_model_pub() function, before cancelling and rescheduling the next publication. The timer handler would only handle retransmissions, and would abandon the rest of the publication event if one of the packets failed to send. This design has three issues: - If the initial timer cancel fails, the publication would interfer with the periodic publication management, which might skip an event or send too many packets. - If any of the messages fail to publish, the full publication event would be abandoned. This is not predictable or expected from the API. - bt_mesh_model_pub() required 384 bytes of stack to build the message, which has to be factored into all calling threads. This patch moves all transmission into the publication timer by replacing k_work_cancel with a single k_work_reschedule(K_NO_WAIT). It also changes the error recovery behavior to attempt to finish the full publication event even if some of the transmissions fail. Split out from zephyrproject-rtos#33782. Signed-off-by: Trond Einar Snekvik <[email protected]>

The Transport layer would previously rely on the access layer to check whether there's room for the full message and a MIC in the available buffer space, and its own checks would ignore the MIC. This should be handled by the Transport layer checks, so the access layer doesn't have to. Signed-off-by: Trond Einar Snekvik <[email protected]>

trond-snekvik · 2021-04-15T13:02:45Z

@pabigot thanks for the thorough notes.

[WQ] publish_transmit invokes bt_mesh_trans_send with callbacks that invoke publish_start and publish_sent. (Are the callbacks invoked by the calling (WQ) thread?)

To answer your question - the callbacks are either called from the cooperative advertising thread (adv_legacy.c), from a work handler, or inline in the bt_mesh_trans_send() call.

I have reviewed the TBDs in main and cfg_srv, and added a check for disabled and suspended publication in the work handler to catch any cancellation failures.

You're right about the count variable, it's going to have a race condition if we call publish() from a preemptive thread, but this is far from the only SMP issue in the mesh stack. In fact, the entire Bluetooth stack is full of SMP and preemptive scheduling issues, and calling any Bluetooth API from a preemptive thread is effectively undefined behavior. Changing this would be more work than we'd be able to do for 2.6, and I'd rather prioritize the stuff we know we can fix.

Now, for the cooperative scheduling case:

If any of the cancel calls fail, we may wait for the work item to fire (and return early) before the publication state comes to rest. In the config server, I believe this is unproblematic, as the on-air interface cannot guarantee atomic control anyway.

The bt_mesh_suspend() and bt_mesh_reset() calls are a bit more questionable. In both cases, we can guarantee that the stack won't be sending anything after the calls finish, but I we can't guarantee that the work items are idle. As far as I can tell, there's no way to get any invalid behavior because of this, but it's not ideal. I don't think this is fully resolvable, as we have to make these APIs callable from the system work queue. The user can call k_work_flush outside of the stack to get around this, if necessary.

pabigot

Given recognition that SMP-related race conditions are still present, I think this does a good job of reducing problems from preemptible threads. Documenting what happens when the cancel operation fails is also a good practice.

When Light LC Mode Set message is sent with the value Off, 2 messages are published: Light LC Light OnOff Status and Light LC Mode Status. After publication has been moved from the caller thread to work queue in zephyrproject-rtos/zephyr#34310 it became impossible to publish 2 messages consecutively. However, according to the changes in the model spec, section 6.5.1.1 (see Errata ID 11372), Light LC Server should not use publication for configuration data: Changes in the Light LC Mode state, or in the Light LC Occupancy Mode state, or in the Light LC Light OnOff state shall not trigger publication of the Light LC Mode Status, or Light LC OM Status, or Light LC Light OnOff Status messages. This change removes publication of Mode and Occupancy Mode states upon reception of corresponding Set messages. Light OnOff state is published according to state machine's state changes. Signed-off-by: Pavel Vasilyev <[email protected]>

After publication API change in zephyrproject-rtos/zephyr#34310 it became impossible to publish several messages in a row, because a new message substitues the current message that is going to be published by access layer. This change swaps sensor after each work timeout. Signed-off-by: Pavel Vasilyev <[email protected]>

trond-snekvik requested a review from pabigot April 15, 2021 10:28

trond-snekvik requested review from jhedberg, joerchan and Vudentz as code owners April 15, 2021 10:28

github-actions bot added area: API Changes to public APIs area: Bluetooth area: Bluetooth Mesh labels Apr 15, 2021

trond-snekvik force-pushed the access_pub branch from 74e06a4 to 1e4839a Compare April 15, 2021 10:44

jhedberg approved these changes Apr 15, 2021

View reviewed changes

joerchan reviewed Apr 15, 2021

View reviewed changes

trond-snekvik added 2 commits April 15, 2021 14:17

trond-snekvik force-pushed the access_pub branch from 1e4839a to 799c804 Compare April 15, 2021 13:02

pabigot approved these changes Apr 15, 2021

View reviewed changes

carlescufi merged commit 5c6df44 into zephyrproject-rtos:master Apr 15, 2021

trond-snekvik mentioned this pull request Apr 16, 2021

Changes to bluetooth mesh for k_work API #33782

Merged

PavelVPV mentioned this pull request May 18, 2021

Bluetooth: Mesh: Pass correct pointer to publish_sent #35296

Merged

PavelVPV mentioned this pull request May 18, 2021

Bluetooth: Mesh: Remove configuration data publication in Light LC nrfconnect/sdk-nrf#4568

Merged

PavelVPV mentioned this pull request Jun 14, 2021

Bluetooth: Mesh: Fix sensor client sample publication nrfconnect/sdk-nrf#4834

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bluetooth: Mesh: Rework publication timer #34310

Bluetooth: Mesh: Rework publication timer #34310

trond-snekvik commented Apr 15, 2021

joerchan Apr 15, 2021 •

edited

Loading

trond-snekvik Apr 15, 2021

joerchan commented Apr 15, 2021

pabigot commented Apr 15, 2021 •

edited

Loading

trond-snekvik commented Apr 15, 2021 •

edited

Loading

pabigot commented Apr 15, 2021

trond-snekvik commented Apr 15, 2021

pabigot left a comment

Bluetooth: Mesh: Rework publication timer #34310

Bluetooth: Mesh: Rework publication timer #34310

Conversation

trond-snekvik commented Apr 15, 2021

joerchan Apr 15, 2021 • edited Loading

Choose a reason for hiding this comment

trond-snekvik Apr 15, 2021

Choose a reason for hiding this comment

joerchan commented Apr 15, 2021

pabigot commented Apr 15, 2021 • edited Loading

trond-snekvik commented Apr 15, 2021 • edited Loading

pabigot commented Apr 15, 2021

trond-snekvik commented Apr 15, 2021

pabigot left a comment

Choose a reason for hiding this comment

joerchan Apr 15, 2021 •

edited

Loading

pabigot commented Apr 15, 2021 •

edited

Loading

trond-snekvik commented Apr 15, 2021 •

edited

Loading