-
Notifications
You must be signed in to change notification settings - Fork 7.3k
drivers: serial: stm32: Data loss & possible hardfault #76799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@djiatsaf-st ^^ |
Hi @HelligeChris , I looked on my side, I have some points to suggest hoping that it can be useful to you. Already I checked if there is a possibility that a hardfault occurs when the offset reaches the max in combination with the memcpy function. I implemented a test similar to the situation and then I checked the status of the CFSR register (Configurable Fault Status Register), its value is 0x00, which means that there is no any type of hardfault. You can also use the fault analyzer in STM32CubeIDE, the result will be the same. However rather than using the memcpy() function, you can implement your own, and there you will get results. I used this function instead of memcpy() for example:
(You can modify it to suit your needs if necessary) As for the data loss and the race condition, I could not fully reproduce your situation, I would need a little more information on the configurations and the use case you want to implement that pushed you to raise this problem. Also, I tested your patch, I added your changes by blocking interrupts, then I did a test to see if I would receive data in a good order as you want, and for the moment I did not have the operation you expect, everything was rather "normal" in the communication as usual. So from my analysis I think that the Zephyr API on asynchronous serial communication is done like this (it seems that it is not directly a FIFO buffer) and it would therefore be necessary to set up a mechanism or a function that processes the data from the buffer well. In zephyr there is a special mechanism implemented to continuously receive data in case there is an overflow (event->data.rx.offset be int max), that could be a possible solution to the problem too. There is a request made by the Zephyr API driver (uart event type = UART_RX_BUF_REQUEST) to have a second rx buffer. Here is the link : we will have a configuration a bit like this:
The global array rx_second_buffer will contain the data history for the defined size. |
Hi @djiatsaf-st, The core issue is the race condition between the timeout (uart_stm32_async_rx_timeout) and actual DMA interrupt (dma_stm32_irq_handler in dma_stm32.c, calls uart_stm32_dma_rx_cb in uart_stm32.c). Since both callbacks makes changes to the variable data->dma_rx.counter, depending on when and if the DMA interrupt interrupts the timeout the data->rx.len can go "negativ" though since the type is size_t it would be 4,294,967,295. This is what we've seen lead to hardfaults.
By max do you mean the size of the buffer or the size_t max size?
In our application we have two different controllers with serial communication between them, where one acts as a sensor while the other periodically and asynchronously gets the newest sensor value. The UART rx is enabled like this:
So, we expect the driver to timeout after 1 ms with no new data received. The problem occurs in the case where the timeout is triggered and the dma rx buffer is almost full, while the timeout callback is running more bytes is received and the dma rx buffer is completely filled triggering the DMA interrupt. Since the DMA interrupt is an ISR the timeout callback is "paused" and the DMA interrupt ISR is called. After the DMA interrupt ISR is finished the timeout callback continues where it left off. This leads to data received out of order/not in sequence.
Just to be clear, by "normal" do you mean data is received sequentially? Since this is also the operation I expect after applying the patch.
I believe we already use this mechanism in our use case, here's our complete callback:
|
Thanks for your answers
i mean a buffer size
No, "normal" does not mean that the data will be received sequentially, that's why I put normal in quotes. I meant by normal, since I use uart communication on zephyr, I always had to make some modifications on the buffer data to have the right values. For the rest, I took note and I understand well the use case and the issue. I have to do some tests on my side and check if the patch does not impact the current state of the driver. If I have other questions I will come back to you. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
Please do not close this issue, if it has not been fixed in newest Zephyr. This issue is real and causing data loss/mangling when using the STM32 UART in DMA mode. Until fixed we will have to manually implement the fix for each Zephyr update to our production code. The fix is literally a two line patch of |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
I ran into the same problem. The observed consequence is, that the stm32 UART async callback is called with an UART_RX_RDY event holding data length of e.g. 0xFFFFFFF4/4294967284. Adding the irq_lock/irq_unlock fixes the problem. The function uart_stm32_async_rx_timeout causes another problem. What is the reason for calling uart_stm_async_rx_disable if the dma_rx.counter is dma_rx.buffer_length? This leads to disabling RX every time the received data chunk ends with the buffer length. This can be observed if e.g. bytes are sent byte by byte with a small sleep in between from the peer device. As a consequence of disabling RX some bytes are lost. I suggest to remove the call of uart_stm32_async_rx_disable (or do I misunderstand something in the code?)
|
I can confirm that adding the We can now transfer up to 500,000 bytes without loss of data. Before applying both patches, this was not possible. Only applying Now I wonder what was the call to |
What bothers me a little is that @HelligeChris IMHO very comprehensibly explained the race condition between |
Hello everyone, First of all, I understand your questions and apologize for the long delay in responding. This is due to a few reasons. The bandwidth at work is quite limited, so there isn't enough time to confirm or proposed solution. Since the last time until your recent comments, I found the solution to be functional but perhaps not optimal. Tests on several board series haven't shown any regression, but it still raises concerns about what happens if another IRQ (from elsewhere) occurs during the blocked timeout phase. The IRQ lock should preferably be as short as possible to avoid blocking or slowing down the process. That's why this issue has been dragging on for a while, and to move things forward, these points need to be confirmed. Regarding uart_stm32_async_rx_disable(), from my understanding, the data is supposed to be stored in the buffer before the timeout. During this period of inactivity, the device can be released, and a second buffer can be reattached if requested by the application. I ran some tests removing this function, and there were regressions such as RX_BUF_RELEASED timeout. This may not be very clear, but that's a bit of an overview of the problem from my perspective. Thank you for understanding |
Describe the bug
The async_user_callback can be triggered from two places one from the DMA transfer complete interrupt generated by the DMA and from an k_work queue timeout.
This timeout is not in a ISR context, which means the DMA transfer complete event can interrupt the timeout, resulting in data not in sequence and a DMA buffer offset and length that can be changed while the timeout is using these variables.
This can be seen on the callstack, note the timeout is mid copy (memcpy) when the interrupt triggeres;
To Reproduce
We've used this callback handler to debug the issue simply be setting a breakpoint at LOG_ERR("UART FAULT"), the variables bad_message_last_buffer_ptr and expected_new_offset needs to be stored outside the callback.
From what I can see this issue is a race condition, for example if the timeout triggers when the DMA buffer is almost filled (missing 1 byte or something) and the last bytes are received, the DMA transfer complete interrupt will trigger and interrupt the timeout.
Expected behavior
The timeout cannot be interrupted by DMA transfer complete.
Impact
For now we've had to patch the uart_stm32.c driver with an irq_lock/irq_unlock in and start and end of the uart_stm32_async_rx_timeout function;
However, without this the driver is completely broken i.e. can lead to hard faults (we've seen the event->data.rx.offset be int max, which when used with a memcpy causes a hardfault) or simply just lost data.
Environment (please complete the following information):
Additional context
This was tested on the STM32H743
The text was updated successfully, but these errors were encountered: