-
Notifications
You must be signed in to change notification settings - Fork 7.3k
samples: bluetooth: Increase number of buffers for avoiding deadlocks #88655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
samples: bluetooth: Increase number of buffers for avoiding deadlocks #88655
Conversation
Thanks for the fix! We might have other samples that need similar changes too. We've "automated" this for Mesh (which from the host's perspective is just another app), but for any other apps I suppose this just needs to be manually set through zephyr/subsys/bluetooth/common/Kconfig Line 208 in 1e1f944
|
The central_multilink sample issues various API calls (in particular because it does some from the system workqueue) this results in a deadlock since there are not enough HCI command buffers available,this is causing deadlocks on hardware that do not support BT_HCI_ACL_FLOW_CONTROL. Signed-off-by: Gudipudi Ramana Kumar <[email protected]>
deaee5f
to
1f61394
Compare
# Increase the number of buffers to avoid deadlock when running out of buffers | ||
CONFIG_BT_BUF_CMD_TX_COUNT=2 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are not enough
HCI command buffers available,this is causing deadlocks on hardware
that do not support BT_HCI_ACL_FLOW_CONTROL.
Could you explain how Controller to Host "data" flow control feature "disabled" needs an increase in HCI "command" buffers? Please update the commit message accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cvinayak I believe this comes from here, i.e. the code is adding one more command buffer when flow control is enabled:
zephyr/subsys/bluetooth/common/hci_common_internal.h
Lines 35 to 38 in b2b7a46
#if defined(CONFIG_BT_HCI_ACL_FLOW_CONTROL) | |
/* When Controller to Host data flow control is supported in Host plus Controller build, or | |
* Host-only build, then we need additional BT_BUF_ACL_RX_COUNT number of HCI command buffers for | |
* enqueuing Host Number of Completed Packets command. |
So the code might look like it's working, but in fact the way the application uses the APIs (and in particular which context it uses them from) creates a requirement for an additional buffer. I agree that this would be good to explain in the commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am interested in the HCI command flow (order) that is causing the deadlock; is the deadlock between two threads, which threads?, can this still happen with CONFIG_BT_HCI_ACL_FLOW_CONTROL=y
? i.e. is there a requirement to enqueue 3 HCI Cmd Buffers (HCI due to API from thread 1, HCI due to API thread 2, and HCI Host Number of Completed Packets cmd)?
(may be, we do not need to solve it in this PR but do update the commit message or have in-line comment with details)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cvinayak I'm attaching the log with the hci command flow:
*** Booting Zephyr OS build v4.1.0-2607-gf9220146e623 ***
bt_hci_cmd_send_sync opcode 0x1003->Sending command 0x1003 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x1003 len 9
bt_hci_cmd_send_sync opcode 0x1001->Sending command 0x1001 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x1001 len 9
bt_hci_cmd_send_sync opcode 0x1002->Sending command 0x1002 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x1002 len 65
bt_hci_cmd_send_sync opcode 0x2003->Sending command 0x2003 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x2003 len 9
bt_hci_cmd_send_sync opcode 0x2002->Sending command 0x2002 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x2002 len 4
bt_hci_cmd_send_sync opcode 0x201c->Sending command 0x201c (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x201c len 9
bt_hci_cmd_send_sync opcode 0x2001->Sending command 0x2001 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x2001 len 1
bt_hci_cmd_send_sync opcode 0x0c01->Sending command 0x0c01 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x0c01 len 1
bt_hci_cmd_send_sync opcode 0x1009->Sending command 0x1009 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x1009 len 7
Bluetooth initialized
Start Scanning
bt_hci_cmd_send_sync opcode 0x2005->Sending command 0x2005 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x2005 len 1
bt_hci_cmd_send_sync opcode 0x200b->Sending command 0x200b (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x200b len 1
bt_hci_cmd_send_sync opcode 0x200c->Sending command 0x200c (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x200c len 1
Scanning successfully started
Device found: 11:45:07:02:32:F7 (random) (RSSI -89)
Device found: 6F:0D:E8:93:02:1E (random) (RSSI -44)
bt_hci_cmd_send_sync opcode 0x200c->Sending command 0x200c (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x200c len 1
bt_hci_cmd_send_sync opcode 0x200d->Sending command 0x200d (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x200d len 0
Start Scanning
bt_hci_cmd_send_sync opcode 0x2005->ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:447
Controller unresponsive, command opcode 0x2005 timeout with err -11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cvinayak and @jhedberg, can you help me out on how to analyze the thread deadlocks. I was able to notice that while the tx_work
was submitted in bt_tx_irq_raise
it did not call tx_processor
later. I could not directly understand how the hci_cmd_pool
or cmd_data[BT_BUF_CMD_TX_COUNT]
are linked to the tx_work
queue.
Here are the detailed logs:
[00:00:02.099,000] <dbg> bt_hci_core: bt_send: buf 0x13dd0 len 10 type 1
[00:00:02.119,000] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x200c (buf 0x13dd0) to driver
--- 234 messages dropped ---
[00:00:02.119,000] <dbg> bt_hci_core: bt_send: buf 0x13dd0 len 5 type 1
[00:00:02.121,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
[00:00:02.121,000] <dbg> bt_hci_core: tx_processor: TX process start
[00:00:02.122,000] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x13d6c len 6
[00:00:02.122,000] <dbg> bt_hci_core: hci_cmd_complete: opcode 0x200c
[00:00:02.123,000] <dbg> bt_hci_core: hci_cmd_done: opcode 0x200c status 0x00 buf 0x13d6c
[00:00:02.123,000] <dbg> bt_hci_core: hci_cmd_done: sync cmd released
[00:00:02.123,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
[00:00:02.123,000] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x13dd0 opcode 0x200c len 1
[00:00:02.124,000] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x200d param_len 25
[00:00:02.125,000] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x13dd0 opcode 0x200d len 28
[00:00:02.125,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
[00:00:02.125,000] <dbg> bt_hci_core: tx_processor: TX process start
[00:00:02.125,000] <dbg> bt_hci_core: hci_core_send_cmd: fetch cmd
[00:00:02.125,000] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x200d (buf 0x13dd0) to driver
[00:00:02.125,000] <dbg> bt_hci_core: bt_send: buf 0x13dd0 len 28 type 1
[00:00:02.125,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
[00:00:02.126,000] <dbg> bt_hci_core: tx_processor: TX process start
[00:00:02.126,000] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x13d6c len 6
[00:00:02.126,000] <dbg> bt_hci_core: hci_cmd_status: opcode 0x200d
[00:00:02.127,000] <dbg> bt_hci_core: hci_cmd_done: opcode 0x200d status 0x00 buf 0x13d6c
[00:00:02.127,000] <dbg> bt_hci_core: hci_cmd_done: sync cmd released
[00:00:02.127,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
[00:00:02.127,000] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x13dd0 opcode 0x200d len 0
[00:00:02.128,000] <dbg> bt_hci_core: tx_processor: TX process start
[00:00:02.665,000] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x13268 len 21
[00:00:02.666,000] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue
[00:00:02.666,000] <dbg> bt_hci_core: rx_work_handler: buf 0x13268 type 2 len 21
[00:00:02.666,000] <dbg> bt_hci_core: hci_event: event 0x3e
[00:00:02.666,000] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x01
[00:00:02.666,000] <dbg> bt_hci_core: le_legacy_conn_complete: status 0x00 role 0 6F:0D:E8:93:02:1E (random)
[00:00:02.667,000] <dbg> bt_hci_core: bt_hci_le_enh_conn_complete: status 0x00 handle 257 role 0 peer 6F:0D:E8:93:02:1E (random) peer RPA 00:00:00:00:00:00
[00:00:02.667,000] <dbg> bt_hci_core: bt_hci_le_enh_conn_complete: local RPA 00:00:00:00:00:00
[00:00:02.667,000] <dbg> bt_smp: bt_smp_accept: conn 0x4738 handle 257
Start Scanning
[00:00:02.669,000] <dbg> bt_id: set_random_address: 03:DD:3A:E9:8C:0A
[00:00:02.669,000] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2005 param_len 6
[00:00:02.669,000] <dbg> bt_hci_core: bt_hci_cmd_create: buf 0x13dd0
[00:00:02.670,000] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x13dd0 opcode 0x2005 len 9
[00:00:02.670,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX
[00:00:02.670,000] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2016 param_len 2
[00:00:07.666,000] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x13288 len 15
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:446
Controller unresponsive, command opcode 0x2005 timeout with err -11
[00:00:12.672,000] <err> os: r0/a1: 0x00000003 r1/a2: 0x00000000 r2/a3: 0x00000002
[00:00:12.672,000] <err> os: r3/a4: 0x00000003 r12/ip: 0x00000010 r14/lr: 0x0820b17b
[00:00:12.672,000] <err> os: xpsr: 0x01000000
[00:00:12.672,000] <err> os: Faulting instruction address (r15/pc): 0x0820b18a
[00:00:12.672,000] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
[00:00:12.672,000] <err> os: Current thread: 0x4660 (BT RX WQ)
[00:00:12.685,000] <err> os: Halting system
The central_multilink sample issues various API calls (in particular because it does some from the system workqueue) this results in a deadlock since there are not enough HCI command buffers available,this is causing deadlocks on hardware that do not support BT_HCI_ACL_FLOW_CONTROL.