Skip to content

Bluetooth: Mesh: Fix pb-adv link close timeout #61551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Andrewpini
Copy link
Collaborator

Alters the closing timout implementation of the pb-adv so that the provisioner does not close the link until the closing timeout has elapsed. Reduces the timeout from 3 to 2 seconds.
Adds minor delay in reprovision persistence test to avoid failure after changes.

@@ -50,7 +50,7 @@ LOG_MODULE_REGISTER(bt_mesh_pb_adv);

#define RETRANSMIT_TIMEOUT K_MSEC(CONFIG_BT_MESH_PB_ADV_RETRANS_TIMEOUT)
#define BUF_TIMEOUT K_MSEC(400)
#define CLOSING_TIMEOUT 3
#define CLOSING_TIMEOUT 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5.3.1.4.3
A Link Close message is used to close a link. Since this message is not acknowledged, the sender shall repeat this message at least three times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is sent three times. It is ensured by RETRANSMITS_LINK_CLOSE. CLOSING_TIMEOUT defines the, at least intended, wait time before the link should be closed in seconds. I think the default RETRANSMIT_TIMEOUT is 500 msec, so in testing the message is retransmitted 4-5 times.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a counter could be used instead of timeout?

@@ -672,12 +641,40 @@ static int bearer_ctl_send_unacked(struct net_buf *buf, void *user_data)
prov_clear_tx();
k_work_reschedule(&link.prot_timer, PROTOCOL_TIMEOUT);

bt_mesh_adv_send(buf, &buf_sent_cb, user_data);
bt_mesh_adv_send(buf, NULL, user_data);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite weird. You do not provide callback but provide callback data. Something is actually wrong here.

Since user_data is actively used to pass link statuses or close reason after removing callbacks there is no functionality that applies user_data to the final destination. This change, seems, broke it. The functionality requires much more refactoring or should be reverted.

This is quite strange why it wasn't caught by bsim tests. Probably we need bsim tests to check provisioning adv bearer behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is wierd. The only purpose the CB's had was to check if the link was closing and to close it if it was. Other than that I could not see another purpose for the CB's. I was not quite sure if I could remove the user_data, so I left them to be discussed here instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I just made an attempt to fix this with very little knowledge of the prov layers. It is quite possible that we need a major refactoring to fully resolve this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understood it: The buf_sent_cb was only used from within bearer_ctl_send_unacked. All calls to bearer_ctl_send_unacked passes a link status as user data. Originally the user data/status was only used when passed to the close_link call in the buf_sent CB. buf_start only purpose was to ensure that buf_sent was called if an error occured during sending. When I implemented another scheme for closing the link that respected the CLOSING_TIMEOUT define, these CBs did not serve any purpose anymore and could be removed. Am I missing something?

@@ -926,11 +923,9 @@ static void prov_link_close(enum prov_bearer_link_status status)
* be restransmitted at least three times. Retransmit the LINK_CLOSE
* message until CLOSING_TIMEOUT has elapsed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You moved CLOSING_TIMEOUT usage to another place but left a comment nonrelavant. Please keep the code clean for better maintainability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry, my bad 👍

@@ -1034,6 +1034,7 @@ static void test_reprovisioning_provisioner(void)

LOG_INF("Provisioning a remote device...");
ASSERT_OK(k_sem_take(&prov_sem, K_SECONDS(40)));
k_sleep(K_MSEC(200));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here about why this is 200 milliseconds and why this is required.
When I add some huge changes like PSA support or another settings backend those change timing significantly such kind no name delays confuse a lot. Why they were added? and why exact this digit?
Sometimes it looks like an attempt to relax test conditions to hide issues or bugs that raise from time to time. It is very difficult to maintain such kinds of tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do 👍

Copy link
Collaborator

@PavelVPV PavelVPV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you have done is how it was working originally, but then was changed here: #36498 Do you think it won't cause the issue described in that PR?

Another question is that if we really need to make every provisioning 2 seconds longer? The Link Close message can still be missed and the result will be the same on the provisionee, but the provisioner will wait for additional 2 seconds, while it could start provisioning the next device.

Alters the closing timout implementation of the pb-adv so that the
provisioner does not close the link until the closing timeout has
elapsed. Reduces the timeout from 3 to 2 seconds.
Adds minor delay in reprovision persistence test to avoid failure
after changes.

Signed-off-by: Anders Storrø <[email protected]>
@Andrewpini Andrewpini force-pushed the fix_pb_adv_link_close_timeout branch from d1302ba to dd39c8e Compare August 17, 2023 14:12
@Andrewpini
Copy link
Collaborator Author

What you have done is how it was working originally, but then was changed here: #36498 Do you think it won't cause the issue described in that PR?

Another question is that if we really need to make every provisioning 2 seconds longer? The Link Close message can still be missed and the result will be the same on the provisionee, but the provisioner will wait for additional 2 seconds, while it could start provisioning the next device.

Hmm... I need to check. The reason I started this was because the final step of provisioning with PB-adv failed like 80% of the time in test_shell

@Andrewpini
Copy link
Collaborator Author

@PavelVPV after my latest changes I do not think that the issues from the previous PR will be relevant. If the device receives a prov_link_close the link will be closed either by the retransmission timer CB, or in the buf sent CB when the CLOSE_TIMEOUT has elapsed.

When it comes to the question of if we should do this due to increased execution time, I'm not sure. From what I have seen the prov procedure with pb-adv really stuggles to pass the finalizing messages in the procedure, causing the provisionee to retransmit PROV complete messages for the PROTOCOL_TIMEOUT duration. However, its current state does not break the functionality and is copliant with the spec since the CLOSING_TIMEOUT is not mandatory.

If we decide to keep the implementation as is I would like to clean up a bit in the code since all use of CLOSING_TIMEOUT is more or less pointless.

@Andrewpini Andrewpini requested review from PavelVPV and alxelax August 18, 2023 13:54
ctl_buf_create(LINK_CLOSE, &link.close_link.reason, 1, RETRANSMITS_LINK_CLOSE);

if (!buf) {
LOG_ERR("Failed to create link close buffer");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call close_link here in case of error or retry?

link.close_link.start = k_uptime_get();

bearer_ctl_send_link_close();
k_work_reschedule(&link.tx.retransmit, RETRANSMIT_TIMEOUT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move k_work_reschedule inside bearer_ctl_send_link_close?

net_buf_unref(buf);

return 0;
}

static void bearer_ctl_send_link_close(void)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to link_close_send?

@@ -50,7 +50,7 @@ LOG_MODULE_REGISTER(bt_mesh_pb_adv);

#define RETRANSMIT_TIMEOUT K_MSEC(CONFIG_BT_MESH_PB_ADV_RETRANS_TIMEOUT)
#define BUF_TIMEOUT K_MSEC(400)
#define CLOSING_TIMEOUT 3
#define CLOSING_TIMEOUT 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a counter could be used instead of timeout?

}

prov_clear_tx();
k_work_reschedule(&link.prot_timer, PROTOCOL_TIMEOUT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be started only once? And is there any need in it if protocol_timeout doesn't do anything if link is not ADV_LINK_ACTIVE?

@Andrewpini
Copy link
Collaborator Author

Something has changed between d11c0a(The main commit where I discovered this issue) and the current main commit (2218e8). Now it seems the pb-adv is able to complete the full provisioning procedure without failure. I have tried to investigate what the reason is but have not found it this far.

Should we just close this PR then @alxelax & @PavelVPV? Or is these changes still of interest?

@Andrewpini Andrewpini requested a review from PavelVPV August 21, 2023 13:05
@PavelVPV
Copy link
Collaborator

Something has changed between d11c0a(The main commit where I discovered this issue) and the current main commit (2218e8). Now it seems the pb-adv is able to complete the full provisioning procedure without failure. I have tried to investigate what the reason is but have not found it this far.

Should we just close this PR then @alxelax & @PavelVPV? Or is these changes still of interest?

Lets postpone changing the code until we are certain what is an issue.

@Andrewpini Andrewpini closed this Aug 21, 2023
@alxelax
Copy link
Collaborator

alxelax commented Aug 21, 2023

I'd suggest covering this in bsim tests. Seems pb-adv is not covered explicitly. Only over provisioning tests.

@mrfuchs
Copy link
Collaborator

mrfuchs commented Aug 29, 2023

For the record, 2fe4493 seems to have fixed this issue (for me).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants