Skip to content

shell: telnet: Don't close the connection on ENOBUFS error #67596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

mkaranki
Copy link

If there's not enough networking buffers at certain moment, they might become available later. So instead of closing connection (and failing assertation) sleep and retry. This avoid the following assertion failure when there's much of data to send:

net_pkt: Data buffer (1500) allocation failed.
net_tcp: conn: 0x20076024 packet allocation failed, len=1460
shell_telnet: Failed to send -105, shutting down
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/shell/shell_ops.c:416
os: r0/a1:  0x00000004  r1/a2:  0x000001a0  r2/a3:  0x00000004
os: r3/a4:  0x20044380 r12/ip:  0x00001958 r14/lr:  0x080c9027
os:  xpsr:  0x41000000
os: Faulting instruction address (r15/pc): 0x0811ed26
os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
os: Current thread: 0x20045100 (shell_telnet)
os: Halting system

This is related to PR #66287 which implemented not closing the connection if write fails due to EAGAIN.

If there's not enough networking buffers at certain moment,
they might become available later. So instead of closing connection
(and failing assertation) sleep and retry. This avoid the following
assertion failure when there's much of data to send:

    net_pkt: Data buffer (1500) allocation failed.
    net_tcp: conn: 0x20076024 packet allocation failed, len=1460
    shell_telnet: Failed to send -105, shutting down
    ASSERTION FAIL [err == 0] @ .../subsys/shell/shell_ops.c:416
    os: r0/a1:  0x00000004  r1/a2:  0x000001a0  r2/a3:  0x00000004
    os: r3/a4:  0x20044380 r12/ip:  0x00001958 r14/lr:  0x080c9027
    os:  xpsr:  0x41000000
    os: Faulting instruction address (r15/pc): 0x0811ed26
    os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
    os: Current thread: 0x20045100 (shell_telnet)
    os: Halting system

Signed-off-by: Miika Karanki <[email protected]>
@mkaranki mkaranki force-pushed the feature/telnet-ENOBUFS branch from d9ff918 to a78f44c Compare January 15, 2024 10:28
@mkaranki mkaranki marked this pull request as ready for review January 15, 2024 10:55
@zephyrbot zephyrbot added the area: Shell Shell subsystem label Jan 15, 2024
@mkaranki
Copy link
Author

@rlubos and @jukkar : could you also take a look as you've been reviewing #66287 as well?

I'm bit unsure is it ok as a principle to handle the buffer allocation problem this way as the root cause of the assertion failure this is addressing is running out of net buffers, but in our case this seems to solve the issue, nicely logging the buffer allocation error as well over the telnet if the shell logging is enabled. As the shell is often used as a debug/development tool, crashing due to the shell not being able to send data is bit inconvenient when assertions are enabled.

Copy link
Collaborator

@rlubos rlubos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks ok (running out of buffers shouldn't be the reason to close the connection). However that assert at the shell level seems concerning to me, if I understand correctly it would hit as well if the peer simply closes the connection. Perhaps we should have some connection recovery mechanism at the telnet shell backend (just thinking aloud here, of course not asking to handle it in this PR).

@jukkar
Copy link
Member

jukkar commented Jan 15, 2024

if I understand correctly it would hit as well if the peer simply closes the connection.

Also at that point, probably the tmp_cnt is uninitialized and then have some garbage value and we use that in the function.

Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It might make sense to have some limiter here, so if the system is out-of-buffers, we do not sleep forever. Just a suggestion for future PR.

@mkaranki
Copy link
Author

However that assert at the shell level seems concerning to me, if I understand correctly it would hit as well if the peer simply closes the connection.

You're right @rlubos ! Simply doing echo x | telnet 192.168.0.5

... causes the assertion failure due to ENOTCONN error:

[00:05:40.271,000] <err> shell_telnet: Failed to send -128, shutting down
ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/shell/shell_ops.c:416
[00:05:40.271,000] <err> os: r0/a1:  0x00000004  r1/a2:  0x000001a0  r2/a3:  0x00000002
[00:05:40.271,000] <err> os: r3/a4:  0x20044380 r12/ip:  0x0000312f r14/lr:  0x080c9033
[00:05:40.271,000] <err> os:  xpsr:  0x41000000
[00:05:40.271,000] <err> os: Faulting instruction address (r15/pc): 0x0811ee56
[00:05:40.271,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
[00:05:40.271,000] <err> os: Current thread: 0x20045100 (shell_telnet)
[00:05:40.272,000] <err> os: Halting system

I'm also seeing another failure sometimes before the above happens:

[00:00:31.370,000] <err> shell_telnet: Failed to send -128, shutting down
ASSERTION FAIL [context] @ WEST_TOPDIR/zephyr/subsys/net/ip/net_context.c:387
	
[00:00:31.370,000] <err> os: r0/a1:  0x00000004  r1/a2:  0x00000183  r2/a3:  0x00000003
[00:00:31.370,000] <err> os: r3/a4:  0x20044380 r12/ip:  0x00007a8a r14/lr:  0x080d13a1
[00:00:31.370,000] <err> os:  xpsr:  0x41000000
[00:00:31.370,000] <err> os: Faulting instruction address (r15/pc): 0x0811ee56
[00:00:31.370,000] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
[00:00:31.370,000] <err> os: Current thread: 0x20045100 (shell_telnet)
[00:00:31.395,000] <err> os: Halting system

... but I'm using Zephyr 3.4.0 here and there's lots changes since that in net_context.c so that might not be relevant anymore. Retesting with 3.5.0 requires a bit of work for me.

@jukkar
Copy link
Member

jukkar commented Jan 15, 2024

I am not seeing errors with latest upstream in qemu_x86. Tried with this

west build -p -b qemu_x86 samples/net/telnet/ -d ../build/telnet \
    -t run -- -DOVERLAY_CONFIG=../sockets/echo_server/overlay-e1000.conf -DCONFIG_ASSERT=y

and with this command echo x | telnet 192.0.2.1

@jukkar
Copy link
Member

jukkar commented Jan 15, 2024

But with native_sim there is an crash

west build -p -b native_sim samples/net/telnet/ -d ../build/telnet \
  -t run -- -DCONFIG_ASSERT=y -DCONFIG_NATIVE_UART_0_ON_STDINOUT=y \
  -DCONFIG_LOG_BACKEND_UART=y -DCONFIG_SHELL_LOG_BACKEND=n
*** Booting Zephyr OS build zephyr-v3.5.0-4132-gf37d150c1052 ***


uart:~$ ASSERTION FAIL [err == 0] @ WEST_TOPDIR/zephyr/subsys/shell/shell_ops.c:438
@ WEST_TOPDIR/zephyr/lib/os/assert.c:43
Exiting due to fatal error

@jukkar
Copy link
Member

jukkar commented Jan 15, 2024

I created a bug report of the crash issue in #67637

@carlescufi carlescufi merged commit 7c00472 into zephyrproject-rtos:main Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Shell Shell subsystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants