drivers: serial: stm32: Data loss & possible hardfault #76799

HelligeChris · 2024-08-07T14:37:31Z

Describe the bug
The async_user_callback can be triggered from two places one from the DMA transfer complete interrupt generated by the DMA and from an k_work queue timeout.
This timeout is not in a ISR context, which means the DMA transfer complete event can interrupt the timeout, resulting in data not in sequence and a DMA buffer offset and length that can be changed while the timeout is using these variables.

This can be seen on the callstack, note the timeout is mid copy (memcpy) when the interrupt triggeres;

To Reproduce
We've used this callback handler to debug the issue simply be setting a breakpoint at LOG_ERR("UART FAULT"), the variables bad_message_last_buffer_ptr and expected_new_offset needs to be stored outside the callback.
From what I can see this issue is a race condition, for example if the timeout triggers when the DMA buffer is almost filled (missing 1 byte or something) and the last bytes are received, the DMA transfer complete interrupt will trigger and interrupt the timeout.

static void dma_tx_done_callback(const struct device *dev, struct uart_event *evt, void *user_data) {
        auto *data  = static_cast<UART::dma_callback_data *>(user_data);
        int32_t err = 0;

        switch (evt->type) {
            case UART_RX_RDY: {
                IO::Message message;

                uint8_t *dataLocation = &evt->data.rx.buf[evt->data.rx.offset];

                if (data->bad_message_last_buffer_ptr != evt->data.rx.buf) {
                    data->expected_new_offset = 0U;
                }
                data->bad_message_last_buffer_ptr = evt->data.rx.buf;

                if (data->expected_new_offset != evt->data.rx.offset) {
                    LOG_ERR("UART Fault");
                } else {
                    data->expected_new_offset += evt->data.rx.len;

                    memcpy(message.data.begin(), dataLocation, evt->data.rx.len);
                    message.length = evt->data.rx.len;

                    err = k_msgq_put(&data->recv_buffer, &message, K_NO_WAIT);
                    if (err != 0) {
                        data->uart->rxError = err;
                    }
                }

                err = k_poll_signal_raise(&data->recv_signal, 1U);
                if (err != 0) {
                    data->uart->rxError = err;
                }

            } break;

// Rest of the callback...

Expected behavior
The timeout cannot be interrupted by DMA transfer complete.

Impact
For now we've had to patch the uart_stm32.c driver with an irq_lock/irq_unlock in and start and end of the uart_stm32_async_rx_timeout function;

static void uart_stm32_async_rx_timeout(struct k_work *work)
{
	// Added to stop the dma transfer complete interrupt for interrupting while this is running,
	// Since they share data 
	unsigned int key = irq_lock();

	struct k_work_delayable *dwork = k_work_delayable_from_work(work);
	struct uart_dma_stream *rx_stream =
		CONTAINER_OF(dwork, struct uart_dma_stream, timeout_work);
	struct uart_stm32_data *data = CONTAINER_OF(rx_stream, struct uart_stm32_data, dma_rx);
	const struct device *dev = data->uart_dev;

	LOG_DBG("rx timeout");

	if (data->dma_rx.counter == data->dma_rx.buffer_length) {
		uart_stm32_async_rx_disable(dev);
	} else {
		uart_stm32_dma_rx_flush(dev);
	}
 
	irq_unlock(key);
}

However, without this the driver is completely broken i.e. can lead to hard faults (we've seen the event->data.rx.offset be int max, which when used with a memcpy causes a hardfault) or simply just lost data.

Environment (please complete the following information):

OS: Linux
Toolchain GCC 12.3.1
Zephyr 3.6.0

Additional context
This was tested on the STM32H743

The text was updated successfully, but these errors were encountered:

erwango · 2024-08-19T12:43:57Z

@djiatsaf-st ^^

djiatsaf-st · 2024-08-21T10:07:51Z

Hi @HelligeChris ,
Thank you for your contributions.

I looked on my side, I have some points to suggest hoping that it can be useful to you.

Already I checked if there is a possibility that a hardfault occurs when the offset reaches the max in combination with the memcpy function. I implemented a test similar to the situation and then I checked the status of the CFSR register (Configurable Fault Status Register), its value is 0x00, which means that there is no any type of hardfault. You can also use the fault analyzer in STM32CubeIDE, the result will be the same.

However rather than using the memcpy() function, you can implement your own, and there you will get results.

I used this function instead of memcpy() for example:

void copyArray(uint8_t *src, uint8_t *dest, uint8_t offset, uint8_t size) { 

            for (int i = offset; i < size; i++) { 

                          *(dest + i) = *(src + i);  

                  } 

    }

(You can modify it to suit your needs if necessary)

As for the data loss and the race condition, I could not fully reproduce your situation, I would need a little more information on the configurations and the use case you want to implement that pushed you to raise this problem.

Also, I tested your patch, I added your changes by blocking interrupts, then I did a test to see if I would receive data in a good order as you want, and for the moment I did not have the operation you expect, everything was rather "normal" in the communication as usual.

So from my analysis I think that the Zephyr API on asynchronous serial communication is done like this (it seems that it is not directly a FIFO buffer) and it would therefore be necessary to set up a mechanism or a function that processes the data from the buffer well.

In zephyr there is a special mechanism implemented to continuously receive data in case there is an overflow (event->data.rx.offset be int max), that could be a possible solution to the problem too. There is a request made by the Zephyr API driver (uart event type = UART_RX_BUF_REQUEST) to have a second rx buffer.

Here is the link :

https://docs.zephyrproject.org/latest/doxygen/html/group__uart__async.html#ggaf0c9513cbafa36d86b4c83f2b6a03dcda0d250f372702526f1bce6d4dfe166abe

we will have a configuration a bit like this:

case UART_RX_BUF_REQUEST: 
              uart_rx_buf_rsp(dev, rx_second_buffer,sizeof(rx_second_buffer)); 
      break;

The global array rx_second_buffer will contain the data history for the defined size.

HelligeChris · 2024-08-21T11:30:34Z

Hi @djiatsaf-st,

The core issue is the race condition between the timeout (uart_stm32_async_rx_timeout) and actual DMA interrupt (dma_stm32_irq_handler in dma_stm32.c, calls uart_stm32_dma_rx_cb in uart_stm32.c). Since both callbacks makes changes to the variable data->dma_rx.counter, depending on when and if the DMA interrupt interrupts the timeout the data->rx.len can go "negativ" though since the type is size_t it would be 4,294,967,295. This is what we've seen lead to hardfaults.

Already I checked if there is a possibility that a hardfault occurs when the offset reaches the max in combination with the memcpy function. I implemented a test similar to the situation and then I checked the status of the CFSR register (Configurable Fault Status Register), its value is 0x00, which means that there is no any type of hardfault. You can also use the fault analyzer in STM32CubeIDE, the result will be the same.

By max do you mean the size of the buffer or the size_t max size?

As for the data loss and the race condition, I could not fully reproduce your situation, I would need a little more information on the configurations and the use case you want to implement that pushed you to raise this problem.

In our application we have two different controllers with serial communication between them, where one acts as a sensor while the other periodically and asynchronously gets the newest sensor value. The UART rx is enabled like this:

rc = uart_rx_enable(
                this->device,
                this->rx_buffer.at(this->cb_data.rx_buffer_chosen).begin(),
                this->rx_buffer.at(this->cb_data.rx_buffer_chosen).max_size(),
                1000
            );

So, we expect the driver to timeout after 1 ms with no new data received. The problem occurs in the case where the timeout is triggered and the dma rx buffer is almost full, while the timeout callback is running more bytes is received and the dma rx buffer is completely filled triggering the DMA interrupt. Since the DMA interrupt is an ISR the timeout callback is "paused" and the DMA interrupt ISR is called. After the DMA interrupt ISR is finished the timeout callback continues where it left off. This leads to data received out of order/not in sequence.

Also, I tested your patch, I added your changes by blocking interrupts, then I did a test to see if I would receive data in a good order as you want, and for the moment I did not have the operation you expect, everything was rather "normal" in the communication as usual.

Just to be clear, by "normal" do you mean data is received sequentially? Since this is also the operation I expect after applying the patch.

In zephyr there is a special mechanism implemented to continuously receive data in case there is an overflow (event->data.rx.offset be int max), that could be a possible solution to the problem too. There is a request made by the Zephyr API driver (uart event type = UART_RX_BUF_REQUEST) to have a second rx buffer.

I believe we already use this mechanism in our use case, here's our complete callback:

    static void dma_tx_done_callback(const struct device *dev, struct uart_event *evt, void *user_data) {
        auto *data  = static_cast<UART::dma_callback_data *>(user_data);
        int32_t err = 0;

        switch (evt->type) {
            case UART_RX_RDY: {
                IO::Message message;

                uint8_t *dataLocation = &evt->data.rx.buf[evt->data.rx.offset];

                memcpy(message.data.begin(), dataLocation, evt->data.rx.len);
                message.length = evt->data.rx.len;

                err = k_msgq_put(&data->recv_buffer, &message, K_NO_WAIT);
                if (err != 0) {
                    data->uart->rxError = err;
                }

                err = k_poll_signal_raise(&data->recv_signal, 1U);
                if (err != 0) {
                    data->uart->rxError = err;
                }
            } break;
            case UART_TX_ABORTED:
                [[fallthrough]];
            case UART_TX_DONE: {
                k_sem_give(&data->tx_sem);
            } break;
            case UART_RX_BUF_REQUEST: {
                data->uart->next_dma_buffer();
                err = uart_rx_buf_rsp(
                    dev,
                    data->uart->rx_buffer.at(data->rx_buffer_chosen).begin(),
                    data->uart->rx_buffer.at(data->rx_buffer_chosen).max_size()
                );
                if (err == -EACCES) {
                    data->uart->rx_enable(true);
                }
            } break;
            case UART_RX_BUF_RELEASED:
                break;
            case UART_RX_DISABLED: {
                if (data->uart->get_rx_enabled()) {
                    data->uart->rx_enable(true);
                }
            } break;
            default:
                break;
        }
    }

djiatsaf-st · 2024-08-21T15:31:44Z

Thanks for your answers

By max do you mean the size of the buffer or the size_t max size?

i mean a buffer size

Just to be clear, by "normal" do you mean data is received sequentially? Since this is also the operation I expect after applying the patch.

No, "normal" does not mean that the data will be received sequentially, that's why I put normal in quotes. I meant by normal, since I use uart communication on zephyr, I always had to make some modifications on the buffer data to have the right values.

For the rest, I took note and I understand well the use case and the issue.

I have to do some tests on my side and check if the patch does not impact the current state of the driver.

If I have other questions I will come back to you.

github-actions · 2024-11-18T00:36:36Z

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

vestom · 2024-12-02T15:43:27Z

Please do not close this issue, if it has not been fixed in newest Zephyr. This issue is real and causing data loss/mangling when using the STM32 UART in DMA mode. Until fixed we will have to manually implement the fix for each Zephyr update to our production code.

The fix is literally a two line patch of irq_lock()/irq_unlock() as shown above. Also the need for the fix can be realized by inspecting the code and seeing the race between the DMA transfer complete interrupt and the k_work queue timeout. Other async functions in the driver already has a similar irq_lock()/irq_unlock() protection.

github-actions · 2025-02-02T00:35:44Z

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

c-fis · 2025-04-07T07:00:37Z

I ran into the same problem. The observed consequence is, that the stm32 UART async callback is called with an UART_RX_RDY event holding data length of e.g. 0xFFFFFFF4/4294967284. Adding the irq_lock/irq_unlock fixes the problem.

The function uart_stm32_async_rx_timeout causes another problem. What is the reason for calling uart_stm_async_rx_disable if the dma_rx.counter is dma_rx.buffer_length? This leads to disabling RX every time the received data chunk ends with the buffer length. This can be observed if e.g. bytes are sent byte by byte with a small sleep in between from the peer device. As a consequence of disabling RX some bytes are lost.

I suggest to remove the call of uart_stm32_async_rx_disable (or do I misunderstand something in the code?)

static void uart_stm32_async_rx_timeout(struct k_work *work)
{
	// Added to stop the dma transfer complete interrupt for interrupting while this is running,
	// Since they share data 
	unsigned int key = irq_lock();

	struct k_work_delayable *dwork = k_work_delayable_from_work(work);
	struct uart_dma_stream *rx_stream = CONTAINER_OF(dwork,
			struct uart_dma_stream, timeout_work);
	struct uart_stm32_data *data = CONTAINER_OF(rx_stream,
			struct uart_stm32_data, dma_rx);
	const struct device *dev = data->uart_dev;

	LOG_DBG("rx timeout");

	uart_stm32_dma_rx_flush(dev, STM32_ASYNC_STATUS_TIMEOUT);

	irq_unlock(key);
}

tkem · 2025-04-09T10:26:22Z

I can confirm that adding the irq_lock()/irq_unlock() calls and removing the uart_stm32_async_rx_disable(dev) solves some stability issues we have seen when using the Async UART API with the STM32H747I-DISCO.

We can now transfer up to 500,000 bytes without loss of data. Before applying both patches, this was not possible. Only applying irq_lock()/irq_unlock() improved the situation, but we were still losing data running a simple echo test.

Now I wonder what was the call to uart_stm32_async_rx_disable(dev) supposed to achieve, and will there be any side effects when removing it.

tkem · 2025-04-16T07:41:36Z

What bothers me a little is that @HelligeChris IMHO very comprehensibly explained the race condition between uart_stm32_async_rx_timeout and dma_stm32_irq_handler and provided a very simple fix for that (thanks, BTW) about eight months ago, and the issue still exists. Is there anything we can do to speed things up a little, e.g. provide a PR or something?
As noted earlier, we also had to include the changes by @c-fis to make Async UART work reliably, so any comments on that would be more than welcome.

djiatsaf-st · 2025-04-18T15:23:35Z

Hello everyone,

First of all, I understand your questions and apologize for the long delay in responding. This is due to a few reasons. The bandwidth at work is quite limited, so there isn't enough time to confirm or proposed solution.

Since the last time until your recent comments, I found the solution to be functional but perhaps not optimal.

Tests on several board series haven't shown any regression, but it still raises concerns about what happens if another IRQ (from elsewhere) occurs during the blocked timeout phase. The IRQ lock should preferably be as short as possible to avoid blocking or slowing down the process.

That's why this issue has been dragging on for a while, and to move things forward, these points need to be confirmed.

Regarding uart_stm32_async_rx_disable(), from my understanding, the data is supposed to be stored in the buffer before the timeout. During this period of inactivity, the device can be released, and a second buffer can be reattached if requested by the application. I ran some tests removing this function, and there were regressions such as RX_BUF_RELEASED timeout.
I'd tend to favor keeping this function.

This may not be very clear, but that's a bit of an overview of the problem from my perspective.

Thank you for understanding

HelligeChris added the bug The issue is a bug, or the PR is fixing a bug label Aug 7, 2024

nordicjm added the platform: STM32 ST Micro STM32 label Aug 8, 2024

zephyrbot assigned erwango Aug 8, 2024

erwango added the priority: low Low impact/importance bug label Aug 8, 2024

erwango assigned djiatsaf-st and unassigned erwango Sep 18, 2024

github-actions bot added the Stale label Nov 18, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 2, 2024

erwango reopened this Dec 2, 2024

github-actions bot removed the Stale label Dec 3, 2024

github-actions bot added the Stale label Feb 2, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2025

erwango reopened this Feb 17, 2025

erwango removed the Stale label Feb 17, 2025

erwango added priority: medium Medium impact/importance bug and removed priority: low Low impact/importance bug labels Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drivers: serial: stm32: Data loss & possible hardfault #76799

drivers: serial: stm32: Data loss & possible hardfault #76799

HelligeChris commented Aug 7, 2024

erwango commented Aug 19, 2024

djiatsaf-st commented Aug 21, 2024

HelligeChris commented Aug 21, 2024

djiatsaf-st commented Aug 21, 2024

github-actions bot commented Nov 18, 2024

vestom commented Dec 2, 2024

github-actions bot commented Feb 2, 2025

c-fis commented Apr 7, 2025

tkem commented Apr 9, 2025

tkem commented Apr 16, 2025

djiatsaf-st commented Apr 18, 2025

drivers: serial: stm32: Data loss & possible hardfault #76799

drivers: serial: stm32: Data loss & possible hardfault #76799

Comments

HelligeChris commented Aug 7, 2024

erwango commented Aug 19, 2024

djiatsaf-st commented Aug 21, 2024

HelligeChris commented Aug 21, 2024

djiatsaf-st commented Aug 21, 2024

github-actions bot commented Nov 18, 2024

vestom commented Dec 2, 2024

github-actions bot commented Feb 2, 2025

c-fis commented Apr 7, 2025

tkem commented Apr 9, 2025

tkem commented Apr 16, 2025

djiatsaf-st commented Apr 18, 2025