Skip to content

Bluetooth: Mesh: Rework publication timer #34310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 15, 2021

Conversation

trond-snekvik
Copy link
Collaborator

Periodic publication would previously build and send the first
publication inside the bt_mesh_model_pub() function, before cancelling
and rescheduling the next publication. The timer handler would only
handle retransmissions, and would abandon the rest of the publication
event if one of the packets failed to send.

This design has three issues:

  • If the initial timer cancel fails, the publication would interfer with
    the periodic publication management, which might skip an event or
    send too many packets.
  • If any of the messages fail to publish, the full publication event
    would be abandoned. This is not predictable or expected from the API.
  • bt_mesh_model_pub() required 384 bytes of stack to build the message,
    which has to be factored into all calling threads.

This patch moves all transmission into the publication timer by
replacing k_work_cancel with a single k_work_reschedule(K_NO_WAIT). It
also changes the error recovery behavior to attempt to finish the full
publication event even if some of the transmissions fail.

Split out from #33782.

Signed-off-by: Trond Einar Snekvik [email protected]

@@ -632,12 +632,12 @@ int bt_mesh_trans_send(struct bt_mesh_net_tx *tx, struct net_buf_simple *msg,
return -EINVAL;
}

if (msg->len > BT_MESH_TX_SDU_MAX) {
BT_ERR("Not enough segment buffers for length %u", msg->len);
if (msg->len > BT_MESH_TX_SDU_MAX - BT_MESH_MIC_SHORT) {
Copy link
Collaborator

@joerchan joerchan Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unrelated to the rework of the timer, is this a bug-fix?

Should be it's own commit if it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into its own commit.

@joerchan
Copy link
Collaborator

What about the TBDs? Is is still to be decided?

@pabigot
Copy link
Collaborator

pabigot commented Apr 15, 2021

What about the TBDs? Is is still to be decided?

Yes, there's a problem there. I'm looking at it to see how it can be clearly described and to identify potential solutions.

@trond-snekvik
Copy link
Collaborator Author

trond-snekvik commented Apr 15, 2021

Looks like I missed that TBD, sorry about that. This mod_commit function is only called as part of the settings load procedure, and I think it's fair to use schedule with that.
Edit: Sorry, just realized I missed the changes in config server and main. Will review.

@pabigot
Copy link
Collaborator

pabigot commented Apr 15, 2021

Edit: Sorry, just realized I missed the changes in config server and main. Will review.

@trond-snekvik here are my notes on that:

There appear to be data races with the publish state in bt_mesh_model_pub, specifically but maybe not exclusively the count bitfield and period_start field.

In mesh/access.c:

  • bt_mesh_model_publish sets count and other fields and reschedules the work item
  • [WQ] mod_publish runs on the work queue and invokes pub_period_start, publish_transmit, and pub_period_sent
  • [WQ] publish_transmit invokes bt_mesh_trans_send with callbacks that invoke publish_start and publish_sent. (Are the callbacks invoked by the calling (WQ) thread?)
  • [WQ] pub_period_start sets count, invokes an update function(?), then sets more properties and invokes publish_sent
  • publish_sent reads count and schedules the work
  • publish_start invokes publish_sent, reads count, and updates period_start

In mesh/main.c:

  • model_suspend clears count and cancels the work

It's the TBD in mesh/main.c that caught my attention: AFAICT the work handler does check whether count is non-zero; but if it is zero (as will happen if the cancellation after setting count to zero fails) what might happen is the work item starts a new publish cycle.

If the publish_{start,sent} callbacks are only invoked (indirectly) by the work queue thread, then the potential races on an SMP system are:

  • bt_mesh_model_publish from access
  • model_suspend from main
  • the work queue thread

I'm not sure whether there are races in a single-processor with non-preemptive threads. There could be if any of the operations invoked by mod_publish sleep, because in that situation the cancellation could fail.

Periodic publication would previously build and send the first
publication inside the bt_mesh_model_pub() function, before cancelling
and rescheduling the next publication. The timer handler would only
handle retransmissions, and would abandon the rest of the publication
event if one of the packets failed to send.

This design has three issues:
- If the initial timer cancel fails, the publication would interfer with
  the periodic publication management, which might skip an event or
  send too many packets.
- If any of the messages fail to publish, the full publication event
  would be abandoned. This is not predictable or expected from the API.
- bt_mesh_model_pub() required 384 bytes of stack to build the message,
  which has to be factored into all calling threads.

This patch moves all transmission into the publication timer by
replacing k_work_cancel with a single k_work_reschedule(K_NO_WAIT). It
also changes the error recovery behavior to attempt to finish the full
publication event even if some of the transmissions fail.

Split out from zephyrproject-rtos#33782.

Signed-off-by: Trond Einar Snekvik <[email protected]>
The Transport layer would previously rely on the access layer to check
whether there's room for the full message and a MIC in the available
buffer space, and its own checks would ignore the MIC. This should be
handled by the Transport layer checks, so the access layer doesn't have
to.

Signed-off-by: Trond Einar Snekvik <[email protected]>
@trond-snekvik
Copy link
Collaborator Author

@pabigot thanks for the thorough notes.

[WQ] publish_transmit invokes bt_mesh_trans_send with callbacks that invoke publish_start and publish_sent. (Are the callbacks invoked by the calling (WQ) thread?)

To answer your question - the callbacks are either called from the cooperative advertising thread (adv_legacy.c), from a work handler, or inline in the bt_mesh_trans_send() call.

I have reviewed the TBDs in main and cfg_srv, and added a check for disabled and suspended publication in the work handler to catch any cancellation failures.

You're right about the count variable, it's going to have a race condition if we call publish() from a preemptive thread, but this is far from the only SMP issue in the mesh stack. In fact, the entire Bluetooth stack is full of SMP and preemptive scheduling issues, and calling any Bluetooth API from a preemptive thread is effectively undefined behavior. Changing this would be more work than we'd be able to do for 2.6, and I'd rather prioritize the stuff we know we can fix.

Now, for the cooperative scheduling case:

If any of the cancel calls fail, we may wait for the work item to fire (and return early) before the publication state comes to rest. In the config server, I believe this is unproblematic, as the on-air interface cannot guarantee atomic control anyway.

The bt_mesh_suspend() and bt_mesh_reset() calls are a bit more questionable. In both cases, we can guarantee that the stack won't be sending anything after the calls finish, but I we can't guarantee that the work items are idle. As far as I can tell, there's no way to get any invalid behavior because of this, but it's not ideal. I don't think this is fully resolvable, as we have to make these APIs callable from the system work queue. The user can call k_work_flush outside of the stack to get around this, if necessary.

Copy link
Collaborator

@pabigot pabigot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given recognition that SMP-related race conditions are still present, I think this does a good job of reducing problems from preemptible threads. Documenting what happens when the cancel operation fails is also a good practice.

@carlescufi carlescufi merged commit 5c6df44 into zephyrproject-rtos:master Apr 15, 2021
PavelVPV added a commit to PavelVPV/sdk-nrf that referenced this pull request May 18, 2021
When Light LC Mode Set message is sent with the value Off,
2 messages are published: Light LC Light OnOff Status and
Light LC Mode Status.

After publication has been moved from the caller thread to
work queue in zephyrproject-rtos/zephyr#34310
it became impossible to publish 2 messages consecutively.

However, according to the changes in the model spec,
section 6.5.1.1 (see Errata ID 11372), Light LC Server should not
use publication for configuration data:
Changes in the Light LC Mode state, or in the Light LC Occupancy
Mode state, or in the Light LC Light OnOff state shall not
trigger publication of the Light LC Mode Status, or Light LC
OM Status, or Light LC Light OnOff Status messages.

This change removes publication of Mode and Occupancy Mode states
upon reception of corresponding Set messages. Light OnOff state
is published according to state machine's state changes.

Signed-off-by: Pavel Vasilyev <[email protected]>
trond-snekvik pushed a commit to nrfconnect/sdk-nrf that referenced this pull request May 18, 2021
When Light LC Mode Set message is sent with the value Off,
2 messages are published: Light LC Light OnOff Status and
Light LC Mode Status.

After publication has been moved from the caller thread to
work queue in zephyrproject-rtos/zephyr#34310
it became impossible to publish 2 messages consecutively.

However, according to the changes in the model spec,
section 6.5.1.1 (see Errata ID 11372), Light LC Server should not
use publication for configuration data:
Changes in the Light LC Mode state, or in the Light LC Occupancy
Mode state, or in the Light LC Light OnOff state shall not
trigger publication of the Light LC Mode Status, or Light LC
OM Status, or Light LC Light OnOff Status messages.

This change removes publication of Mode and Occupancy Mode states
upon reception of corresponding Set messages. Light OnOff state
is published according to state machine's state changes.

Signed-off-by: Pavel Vasilyev <[email protected]>
PavelVPV added a commit to PavelVPV/sdk-nrf that referenced this pull request Jun 14, 2021
After publication API change in zephyrproject-rtos/zephyr#34310
it became impossible to publish several messages in a row,
because a new message substitues the current message that
is going to be published by access layer.

This change swaps sensor after each work timeout.

Signed-off-by: Pavel Vasilyev <[email protected]>
carlescufi pushed a commit to nrfconnect/sdk-nrf that referenced this pull request Jun 17, 2021
After publication API change in zephyrproject-rtos/zephyr#34310
it became impossible to publish several messages in a row,
because a new message substitues the current message that
is going to be published by access layer.

This change swaps sensor after each work timeout.

Signed-off-by: Pavel Vasilyev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants