Skip to content

samples: bluetooth: Increase number of buffers for avoiding deadlocks #88655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gudipudiramanakumar
Copy link
Contributor

The central_multilink sample issues various API calls (in particular because it does some from the system workqueue) this results in a deadlock since there are not enough HCI command buffers available,this is causing deadlocks on hardware that do not support BT_HCI_ACL_FLOW_CONTROL.

@jhedberg
Copy link
Member

Thanks for the fix! We might have other samples that need similar changes too. We've "automated" this for Mesh (which from the host's perspective is just another app), but for any other apps I suppose this just needs to be manually set through prj.conf:

default 2 if BT_MESH

The central_multilink sample issues various API calls
(in particular because it does some from the system workqueue)
this results in a deadlock since there are not enough
HCI command buffers available,this is causing deadlocks on hardware
that do not support BT_HCI_ACL_FLOW_CONTROL.

Signed-off-by: Gudipudi Ramana Kumar <[email protected]>
Comment on lines +8 to +10
# Increase the number of buffers to avoid deadlock when running out of buffers
CONFIG_BT_BUF_CMD_TX_COUNT=2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are not enough
HCI command buffers available,this is causing deadlocks on hardware
that do not support BT_HCI_ACL_FLOW_CONTROL.

Could you explain how Controller to Host "data" flow control feature "disabled" needs an increase in HCI "command" buffers? Please update the commit message accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cvinayak I believe this comes from here, i.e. the code is adding one more command buffer when flow control is enabled:

#if defined(CONFIG_BT_HCI_ACL_FLOW_CONTROL)
/* When Controller to Host data flow control is supported in Host plus Controller build, or
* Host-only build, then we need additional BT_BUF_ACL_RX_COUNT number of HCI command buffers for
* enqueuing Host Number of Completed Packets command.

So the code might look like it's working, but in fact the way the application uses the APIs (and in particular which context it uses them from) creates a requirement for an additional buffer. I agree that this would be good to explain in the commit message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am interested in the HCI command flow (order) that is causing the deadlock; is the deadlock between two threads, which threads?, can this still happen with CONFIG_BT_HCI_ACL_FLOW_CONTROL=y? i.e. is there a requirement to enqueue 3 HCI Cmd Buffers (HCI due to API from thread 1, HCI due to API thread 2, and HCI Host Number of Completed Packets cmd)?

(may be, we do not need to solve it in this PR but do update the commit message or have in-line comment with details)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cvinayak I'm attaching the log with the hci command flow:


*** Booting Zephyr OS build v4.1.0-2607-gf9220146e623 ***
bt_hci_cmd_send_sync opcode 0x1003->Sending command 0x1003 (buf 0x13470) to driver
sync cmd released
rsp 0x13470 opcode 0x1003 len 9
bt_hci_cmd_send_sync opcode 0x1001->Sending command 0x1001 (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x1001 len 9                                                                                                                                                    
bt_hci_cmd_send_sync opcode 0x1002->Sending command 0x1002 (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x1002 len 65                                                                                                                                                   
bt_hci_cmd_send_sync opcode 0x2003->Sending command 0x2003 (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x2003 len 9                                                                                                                                                    
bt_hci_cmd_send_sync opcode 0x2002->Sending command 0x2002 (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x2002 len 4                                                                                                                                                    
bt_hci_cmd_send_sync opcode 0x201c->Sending command 0x201c (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x201c len 9                                                                                                                                                    
bt_hci_cmd_send_sync opcode 0x2001->Sending command 0x2001 (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x2001 len 1                                                                                                                                                    
bt_hci_cmd_send_sync opcode 0x0c01->Sending command 0x0c01 (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x0c01 len 1                                                                                                                                                    
bt_hci_cmd_send_sync opcode 0x1009->Sending command 0x1009 (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x1009 len 7                                                                                                                                                    
Bluetooth initialized                                                                                                                                                              
Start Scanning                                                                                                                                                                     
bt_hci_cmd_send_sync opcode 0x2005->Sending command 0x2005 (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x2005 len 1                                                                                                                                                    
bt_hci_cmd_send_sync opcode 0x200b->Sending command 0x200b (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x200b len 1                                                                                                                                                    
bt_hci_cmd_send_sync opcode 0x200c->Sending command 0x200c (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x200c len 1                                                                                                                                                    
Scanning successfully started                                                                                                                                                      
Device found: 11:45:07:02:32:F7 (random) (RSSI -89)                                                                                                                                
Device found: 6F:0D:E8:93:02:1E (random) (RSSI -44)                                                                                                                                
bt_hci_cmd_send_sync opcode 0x200c->Sending command 0x200c (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x200c len 1                                                                                                                                                    
bt_hci_cmd_send_sync opcode 0x200d->Sending command 0x200d (buf 0x13470) to driver                                                                                                 
sync cmd released                                                                                                                                                                  
rsp 0x13470 opcode 0x200d len 0                                                                                                                                                    
Start Scanning                                                                                                                                                                     
bt_hci_cmd_send_sync opcode 0x2005->ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:447                                                            
        Controller unresponsive, command opcode 0x2005 timeout with err -11                                                                                                        

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cvinayak and @jhedberg, can you help me out on how to analyze the thread deadlocks. I was able to notice that while the tx_work was submitted in bt_tx_irq_raise it did not call tx_processor later. I could not directly understand how the hci_cmd_pool or cmd_data[BT_BUF_CMD_TX_COUNT] are linked to the tx_work queue.

Here are the detailed logs:

[00:00:02.099,000] <dbg> bt_hci_core: bt_send: buf 0x13dd0 len 10 type 1
[00:00:02.119,000] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x200c (buf 0x13dd0) to driver
--- 234 messages dropped ---
[00:00:02.119,000] <dbg> bt_hci_core: bt_send: buf 0x13dd0 len 5 type 1
[00:00:02.121,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX                                                                                                                     
[00:00:02.121,000] <dbg> bt_hci_core: tx_processor: TX process start                                                                                                               
[00:00:02.122,000] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x13d6c len 6                                                                                                            
[00:00:02.122,000] <dbg> bt_hci_core: hci_cmd_complete: opcode 0x200c                                                                                                              
[00:00:02.123,000] <dbg> bt_hci_core: hci_cmd_done: opcode 0x200c status 0x00  buf 0x13d6c                                                                                         
[00:00:02.123,000] <dbg> bt_hci_core: hci_cmd_done: sync cmd released                                                                                                              
[00:00:02.123,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX                                                                                                                     
[00:00:02.123,000] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x13dd0 opcode 0x200c len 1                                                                                        
[00:00:02.124,000] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x200d param_len 25                                                                                                
[00:00:02.125,000] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x13dd0 opcode 0x200d len 28                                                                                       
[00:00:02.125,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX                                                                                                                     
[00:00:02.125,000] <dbg> bt_hci_core: tx_processor: TX process start                                                                                                               
[00:00:02.125,000] <dbg> bt_hci_core: hci_core_send_cmd: fetch cmd                                                                                                                 
[00:00:02.125,000] <dbg> bt_hci_core: hci_core_send_cmd: Sending command 0x200d (buf 0x13dd0) to driver                                                                            
[00:00:02.125,000] <dbg> bt_hci_core: bt_send: buf 0x13dd0 len 28 type 1                                                                                                           
[00:00:02.125,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX                                                                                                                     
[00:00:02.126,000] <dbg> bt_hci_core: tx_processor: TX process start                                                                                                               
[00:00:02.126,000] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x13d6c len 6                                                                                                            
[00:00:02.126,000] <dbg> bt_hci_core: hci_cmd_status: opcode 0x200d                                                                                                                
[00:00:02.127,000] <dbg> bt_hci_core: hci_cmd_done: opcode 0x200d status 0x00  buf 0x13d6c                                                                                         
[00:00:02.127,000] <dbg> bt_hci_core: hci_cmd_done: sync cmd released                                                                                                              
[00:00:02.127,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX                                                                                                                     
[00:00:02.127,000] <dbg> bt_hci_core: bt_hci_cmd_send_sync: rsp 0x13dd0 opcode 0x200d len 0                                                                                        
[00:00:02.128,000] <dbg> bt_hci_core: tx_processor: TX process start                                                                                                               
[00:00:02.665,000] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x13268 len 21                                                                                                           
[00:00:02.666,000] <dbg> bt_hci_core: rx_work_handler: Getting net_buf from queue                                                                                                  
[00:00:02.666,000] <dbg> bt_hci_core: rx_work_handler: buf 0x13268 type 2 len 21                                                                                                   
[00:00:02.666,000] <dbg> bt_hci_core: hci_event: event 0x3e                                                                                                                        
[00:00:02.666,000] <dbg> bt_hci_core: hci_le_meta_event: subevent 0x01                                                                                                             
[00:00:02.666,000] <dbg> bt_hci_core: le_legacy_conn_complete: status 0x00  role 0 6F:0D:E8:93:02:1E (random)                                                                      
[00:00:02.667,000] <dbg> bt_hci_core: bt_hci_le_enh_conn_complete: status 0x00  handle 257 role 0 peer 6F:0D:E8:93:02:1E (random) peer RPA 00:00:00:00:00:00                       
[00:00:02.667,000] <dbg> bt_hci_core: bt_hci_le_enh_conn_complete: local RPA 00:00:00:00:00:00                                                                                     
[00:00:02.667,000] <dbg> bt_smp: bt_smp_accept: conn 0x4738 handle 257                                                                                                             
Start Scanning                                                                                                                                                                     
[00:00:02.669,000] <dbg> bt_id: set_random_address: 03:DD:3A:E9:8C:0A                                                                                                              
[00:00:02.669,000] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2005 param_len 6                                                                                                 
[00:00:02.669,000] <dbg> bt_hci_core: bt_hci_cmd_create: buf 0x13dd0                                                                                                               
[00:00:02.670,000] <dbg> bt_hci_core: bt_hci_cmd_send_sync: buf 0x13dd0 opcode 0x2005 len 9                                                                                        
[00:00:02.670,000] <dbg> bt_hci_core: bt_tx_irq_raise: kick TX                                                                                                                     
[00:00:02.670,000] <dbg> bt_hci_core: bt_hci_cmd_create: opcode 0x2016 param_len 2                                                                                                 
[00:00:07.666,000] <dbg> bt_hci_core: bt_recv_unsafe: buf 0x13288 len 15                                                                                                           
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/bluetooth/host/hci_core.c:446                                                                                                
        Controller unresponsive, command opcode 0x2005 timeout with err -11                                                                                                        
[00:00:12.672,000] <err> os: r0/a1:  0x00000003  r1/a2:  0x00000000  r2/a3:  0x00000002                                                                                            
[00:00:12.672,000] <err> os: r3/a4:  0x00000003 r12/ip:  0x00000010 r14/lr:  0x0820b17b                                                                                            
[00:00:12.672,000] <err> os:  xpsr:  0x01000000                                                                                                                                    
[00:00:12.672,000] <err> os: Faulting instruction address (r15/pc): 0x0820b18a                                                                                                     
[00:00:12.672,000] <err> os: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0                                                                                                        
[00:00:12.672,000] <err> os: Current thread: 0x4660 (BT RX WQ)                                                                                                                     
[00:00:12.685,000] <err> os: Halting system                                                                                                                                        

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Host Bluetooth Host (excluding BR/EDR) area: Bluetooth area: Samples Samples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants