Skip to content

Regression: nNRF52840 HW dies at 8 minutes 20 seconds in various samples using IPSP #11744

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mike-scott opened this issue Nov 29, 2018 · 12 comments
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug platform: nRF Nordic nRFx priority: high High impact/importance bug Regression Something, which was working, does not anymore
Milestone

Comments

@mike-scott
Copy link
Contributor

mike-scott commented Nov 29, 2018

Describe the bug
During automated testing of the LwM2M client in master branch, a regression was noted on both reel_board and nrf52840_pca10056. The board dies after 8 minutes and 20 seconds. Other samples which aren't public also displayed the same behavior. One was an MQTT and HTTP application. All samples tested use IPSP network connection via BLE.

Bisection led to this commit:
baea224 kernel: Always set clock expiry with sync with timeout module

Prior to this commit, the samples runs normally.

To Reproduce
Steps to reproduce the behavior:

  1. cd samples/net/lwm2m_client
  2. mkdir build; cd build
  3. cmake -DBOARD=reel_board -DOVERLAY_CONFIG=overlay-bt.conf ..
  4. make
  5. make flash
  6. board should be connected via IPSP (timeout might happen during init due to using generic IPv6 addresses -- doesn't affect the outcome)
    6a. NOTE: if user wants connection to actually work, they may need to alter the configuration of their IPSP gateway to use the 2001:db8:: subnet.
    6b. Outcome is the same whether or not IPSP networking actually works
  7. wait for 8 minutes and 20 seconds, board will stop responding

Expected behavior
Board / sample should run longer than 8 minutes 20 seconds.

Impact
Unacceptable board behavior

Screenshots or console output

uart:~$ ***** Booting Zephyr OS zephyr-v1.13.0-2173-gbaea22407d *****
[00:00:00.007,110] <dbg> net_lwm2m_engine.lwm2m_engine_init: LWM2M engine thread started
uart:~$ [00:00:00.007,110] <dbg> net_lwm2m_obj_security.security_create: Create LWM2M security instance: 0
uart:~$ [00:00:00.007,141] <dbg> net_lwm2m_obj_server.server_create: Create LWM2M server instance: 0
uart:~$ [00:00:00.007,171] <dbg> net_lwm2m_obj_device.device_create: Create LWM2M device instance: 0
uart:~$ [00:00:00.007,171] <dbg> net_lwm2m_obj_firmware.firmware_create: Create LWM2M firmware instance: 0
uart:~$ [00:00:00.369,354] <inf> net_config.net_config_init: Initializing network
uart:~$ [00:00:00.369,720] <wrn> net_if.net_if_send_data: iface 0x2000bbe0 is down
uart:~$ [00:00:00.369,750] <err> net_if.join_mcast_allnodes: Cannot join all nodes address ff02::1 (-5)
uart:~$ uart:~$ [00:00:17.074,615] <dbg> net_lwm2m_engine.lwm2m_engine_set: path:3/0/10, value:0x20009dec, len:4
uart:~$ [00:00:17.074,676] <dbg> net_lwm2m_engine.lwm2m_engine_set: path:3/0/20, value:0x20009def, len:1
uart:~$ [00:00:17.074,707] <dbg> net_lwm2m_engine.lwm2m_engine_set: path:3/0/21, value:0x20009dec, len:4
uart:~$ [00:00:17.074,737] <dbg> net_lwm2m_engine.lwm2m_engine_create_obj_inst: path:3303/0
uart:~$ [00:00:17.074,768] <dbg> net_ipso_temp_sensor.temp_sensor_create: Create IPSO Temperature Sensor instance: 0
uart:~$ [00:00:17.074,768] <dbg> net_lwm2m_engine.lwm2m_engine_set: path:3303/0/5700, value:0x20009e00, len:8
uart:~$ [00:00:17.074,798] <dbg> net_lwm2m_engine.lwm2m_engine_create_obj_inst: path:3311/0
uart:~$ [00:00:17.074,829] <dbg> net_ipso_light_control.light_control_create: Create IPSO Light Control instance: 0
uart:~$ uart:~$ [00:00:17.165,161] <inf> net_lwm2m_rd_client.lwm2m_rd_client_start: LWM2M Client: nrf52840_pca10056
uart:~$ [00:00:17.373,901] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 0
uart:~$ [00:00:17.883,758] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:00:20.237,945] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:00:24.928,192] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:00:34.298,706] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:00:53.029,754] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:00:53.029,754] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:00:53.029,754] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ uart:~$ [00:00:53.508,758] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ [00:00:54.018,676] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:00:56.372,802] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:01:01.063,049] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:01:10.433,563] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:01:29.164,611] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:01:29.164,611] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:01:29.164,611] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ uart:~$ [00:01:29.676,605] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ [00:01:30.183,380] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:01:32.537,597] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:01:37.227,844] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:01:46.598,358] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:02:05.329,406] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:02:05.329,406] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:02:05.329,406] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ [00:02:05.808,410] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ uart:~$ [00:02:06.318,267] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:02:08.672,454] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:02:13.362,701] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:02:22.733,215] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:02:41.464,263] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:02:41.464,263] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:02:41.464,263] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ [00:02:41.943,267] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ uart:~$ [00:02:42.453,094] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:02:44.807,312] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:02:49.497,558] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:02:58.868,072] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:03:17.599,121] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:03:17.599,121] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:03:17.599,121] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ [00:03:18.078,125] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ uart:~$ [00:03:18.587,982] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:03:20.942,169] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:03:25.632,415] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:03:35.002,929] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:03:53.733,978] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:03:53.733,978] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:03:53.733,978] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ uart:~$ [00:03:54.212,982] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ [00:03:54.722,839] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:03:57.077,026] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:04:01.767,272] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:04:11.137,786] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:04:29.868,835] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:04:29.868,835] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:04:29.868,835] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ uart:~$ [00:04:30.347,839] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ [00:04:30.857,666] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:04:33.211,883] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:04:37.902,130] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:04:47.272,644] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:05:06.003,692] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:05:06.003,692] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:05:06.003,692] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ uart:~$ [00:05:06.482,696] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ [00:05:06.992,675] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:05:09.346,740] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:05:14.036,987] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:05:23.407,501] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:05:42.138,549] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:05:42.138,549] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:05:42.138,549] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ [00:05:42.617,553] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ uart:~$ [00:05:43.127,441] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:05:45.481,597] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:05:50.171,844] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:05:59.542,358] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:06:18.273,406] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:06:18.273,406] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:06:18.273,406] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ [00:06:18.762,390] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ uart:~$ [00:06:19.272,247] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:06:21.626,434] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:06:26.316,680] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:06:35.687,194] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:06:54.418,243] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:06:54.418,243] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:06:54.418,243] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ [00:06:54.897,247] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ uart:~$ [00:06:55.407,104] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:06:57.761,291] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:07:02.451,538] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:07:11.822,052] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:07:30.553,100] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:07:30.553,100] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:07:30.553,100] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ [00:07:31.032,104] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ uart:~$ [00:07:31.541,931] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:07:33.896,148] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:07:38.586,395] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:07:47.956,909] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:08:06.687,957] <wrn> net_lwm2m_rd_client.do_registration_timeout_cb: Registration Timeout
uart:~$ [00:08:06.687,957] <dbg> net_lwm2m_client_app.rd_client_event: Disconnected
uart:~$ [00:08:06.687,957] <dbg> net_lwm2m_client_app.rd_client_event: Registration failure!
uart:~$ uart:~$ [00:08:07.166,961] <inf> net_lwm2m_rd_client.sm_do_init: RD Client started with endpoint 'nrf52840_pca10056' and client lifetime 30
uart:~$ [00:08:07.676,788] <dbg> net_lwm2m_rd_client.sm_send_registration: registration sent [2001:db8::2]
uart:~$ uart:~$ [00:08:10.031,005] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:08:14.721,252] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ [00:08:24.091,766] <dbg> net_lwm2m_engine.retransmit_request: Resending message: 0x20002b90
uart:~$ uart:~$ 

NOTE: Logging just stops here. There is no OOPS.

Environment (please complete the following information):

  • OS: Linux
  • Distribution: Ubuntu 18.04
  • Toolchain: Zephyr SDK
  • Commit SHA: baea224
@mike-scott mike-scott added Regression Something, which was working, does not anymore bug The issue is a bug, or the PR is fixing a bug area: Networking platform: nRF Nordic nRFx labels Nov 29, 2018
@mike-scott mike-scott added this to the v1.14.0 milestone Nov 29, 2018
@carlescufi carlescufi added the priority: high High impact/importance bug label Nov 29, 2018
@carlescufi
Copy link
Member

Assigning to @pdunaj, @andyross and @andrewboie since this looks like a commit to the kernel code.

@mike-scott
Copy link
Contributor Author

mike-scott commented Nov 29, 2018

Updated w/ full log from nrf52840_pca10056 running lwm2m_client sample from master with no changes.

It obviously never connects to LwM2M server due to mismatching network subnet issues w/ the default overlay-bt.conf vs. the gateway's IPSP subnet. But the outcome is the same.

@andyross
Copy link
Collaborator

Huh... that patch fixed a real bug, so it can't just be reverted. This is for sure a rollover condition. This is a 16kHz clock, right? So a signed rollover is going to happen at 2^23 bits, or just about exactly eight and a half minutes. I'll take a look, though @pdunaj is likely to see it first.

Note that I have another rework in progress of this area to prevent the "set timeout on idle" behavior entirely, which is essentially a parallel way to fix the original bug (with rapidly/repeatedly set timeouts bumping the next tick forward). If we can't find the rollover here, we could presumably revert this one once the other fix is ready. If, y'know, it doesn't itself introduce more bugs.

@carlescufi
Copy link
Member

@andyross I think this is a 32KHz clock

@aurel32
Copy link
Collaborator

aurel32 commented Nov 29, 2018

I guess this bug is a duplicate of #11694.

@andyross
Copy link
Collaborator

Hah, even better as it can be an unsigned overflow.

Oh! And I see you literally just pushed a patch. Yeah, that looks like it exactly.

@andyross
Copy link
Collaborator

(Though AFAICT that bug predates @pdunaj 's patch mentioned above. @mike-scott are you sure about that bisection?)

@mike-scott
Copy link
Contributor Author

@andyross Absolutely. I spent quite some time working through the commits. And this bug reproduces perfectly every time.

@andyross
Copy link
Collaborator

Then either we have another bug lurking, or that patch was somehow able to trigger the rollover more reliably in your test in some subtle way that I'm not able to see. The patch in #11747 for sure should fix bugs just like this though.

@mike-scott
Copy link
Contributor Author

Yep, I tested #11747 just now and it fixes the issue I was seeing.

@ntavish
Copy link
Contributor

ntavish commented Nov 30, 2018

I saw this with MQTT sample also. Logging completely stops, but if after it stops, you connect BLE and do IPSP stuff, the networking is still working, can ping to the device, and I am still getting stack init logs like prio recv thread stack (real size 448): unused 188 usage 260 / 448 (58 %).

@pdunaj
Copy link
Collaborator

pdunaj commented Nov 30, 2018

@mike-scott , @andyross , bug that was fixed by the commit mentioned caused delayed work to be not executed. It could be that in the code there was something that was simply not running and now it is allowed to execute and behavior changes. If it is so it should be relatively simple to track down.

... no it cannot be that. I checked the comment again and the PR mentioned above. It looks like an overflow issue on the counter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug platform: nRF Nordic nRFx priority: high High impact/importance bug Regression Something, which was working, does not anymore
Projects
None yet
Development

No branches or pull requests

7 participants