-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Frequent Exception 0 Crashes with v3.1.0 and v3.1.1, but none with Core <v3.0.2 #8830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Likely something to do with SoftwareSerial; 3.0.2 ships with 6.12.7, 3.1.x is 7.0.0 And, unless I am misreading the EXCVADDR, looks like |
Hmm... That's a good clue. Before getting the ESP Exception Decoder working in the Arduino IDE, I tried the esp8266_exception_decoder monitor filter with what I thought wasn't so helpful output, but I did notice it mentioned SoftwareSerial. I just didn't know what to make of it, but maybe it was more meaningful than I thought?
|
For the above, see https://docs.platformio.org/en/latest/projectconf/build_configurations.html |
This observation of yours really struck me and I remember seeing warnings when compiling the sketch from AirGradient. And I just noticed that I don't see those warnings when compiling the sketch recently with Core v3.0.2. Core v3.1.1
Core v3.0.2
####Here is the code snippet from AirGradient.cpp. This is part of the overall set of code that reads the temperature (TMP) and relative humidity (RH) from the sensor. I'm guessing this piece defines the values that are returned if the sensor reading results in an error.
I'm a HW engineer, so forgive me if I'm a bit dense here. I don't know why the two core versions would treat this code differently. And I can't really understand the problem. Is a float or int not able to take on the And I'm not so sure this is related to the SoftwareSerial piece at all because the temperature/humidity sensor is on the i2c interface. This might be nothing. I'm working through the parts of code that use SoftwareSerial now and it's related to the PM2.5 and CO2 sensors. |
Warning is unrelated to the error, you just need to use literal Arduino IDE 1.x and 2.x disable compiler warnings by default. Since #8495, we work around that, but looks like at least one of warnings was missed. It is totally possible that the issue is not in SoftwareSerial, but we just happen to break its operation with something directly or indirectly related to how it is used in the code 🤷 |
Thank you, @mcspr! I probably understood NULL was a pointer at some point during my un-grad studies over 20 years ago. Because you mentioned:
I decided to see if the PlatformIO decoder would give me any additional clues and so I set Any clue as to what direction I need to head down to figure out how building with |
Exception 0 is often ICACHE code that is not cached at the time of access and the SPI bus is busy. The processor reads zero which is also defined as an illegal instruction. Can be caused by a missing IRAM_ATTR on the ISR or functions that it calls.
|
I know that the issue first appeared when moving to Core v3.1.x. It is also mysteriously-resolved when compiling with I can understand that it could crash if an ISR is missing the IRAM_ATTR attribute and an interrupt comes in causing the processor to access it from Flash when the SPI bus happens to be busy, but I don't see how that behavior could be different between the Core versions or the build_type flag? The inline bit is interesting too but again, it looks like it is inline. Your comment about "fails to inline" -- Is that a thing? Is that just a suggestion to the compiler and it's not guaranteed? Could that be something the new Core version or build flag could affect? At the time I captured those exceptions, I also captured the relevant section of atomic_base.h:
Because I just looked at Arduino and ESP (or any microcontroller) for the first time just a few days ago, have almost no software background, and none of the AirGradient code is mine, I've asked AirGradient for help and was told they are contacting their technical person to see if he can help. |
As I understand it
My only thought on this and the older version of the core working is code size changes and which code lands and stays in the cache changes. Some problems can be made worse by changing the cache size from 32K to 16K. |
Looking at https://github.com/esp8266/esp8266-wiki/wiki/Memory-Map, I see 0x4020_0000 is the beginning of SPI Flash mapped area. The D1 Mini has 4MB SPI Flash, so if my mental math serves me, that would be up to 0x403F_FFFF (22 bits). So, I see the exception occurs within that area. Is that indicative of anything? If the instructions of a function is indeed in IRAM, would the PC be pointing to there (0x4010_0000) instead? Or is this why you're saying that the crashing code is not in IRAM (the PC is pointing to Flash)? What about the case of a cached instruction? Would the PC in the exception dump still be pointing to the non-cached location?
But if that's the case, I would expect it to appear to be very temperamental. Like I could get code that works great with one version and not another, then grow the code a bit and then have it work horribly with the previously-good version and great with the previously-bad version, simply because of where things happen to land. But I'm not seeing that. With the several versions of code I've tested, it's always bad with the newer core versions (without the debug flag) -- to a varying degrees (some code crashes much more frequently). And when we are talking about where functions land, are we talking about at compile/build-time or at runtime? I only have a crude understanding of processor architecture (and certainly nothing modern), so sorry if dumb question, but instruction caching sounds like it should be a runtime thing, right? And the inlining and IRAM would be determined at build-time? |
I believe the cacheable address space is limited to 1MB for the ESP8266
I was poorly referring to how the contents of the cache are always changing. The way the compiler and linker organize the code could change with small code changes. The timing of operations could alter the current cache contents between interrupts. I cannot make out from the dump who is calling I have used this macro in leaf functions to force the compiler to save the return address on the stack. It contains no assembly, it lies to the compiler. Instructing it that we are going to clobber register "a0" so that it will save it. #define DEBUG_ESP_BACKTRACELOG_LEAF_FUNCTION(...) __asm__ __volatile__("" ::: "a0", "memory") You would place the macro call at the beginning of the function to be sure register "a0" is saved as soon as possible. And the build define |
Just as a reminder, we also could compare built .elf files
And... nothing else, apparently. queue's edit: Speaking of debug mode and build flags - IDE uses |
Aaand we also may run into a different code when using PlatformIO debug; it also replaces optimization mode with # build_type = debug
'CCFLAGS': [ '-mlongcalls',
'-mtext-section-literals',
'-falign-functions=4',
'-U__STRICT_ANSI__',
'-ffunction-sections',
'-fdata-sections',
'-Wall',
'-Werror=return-type',
'-free',
'-fipa-pta',
'-Og',
'-g2',
'-ggdb2'], # `build_flags = -g`
'CCFLAGS': [ '-g',
'-Os',
'-mlongcalls',
'-mtext-section-literals',
'-falign-functions=4',
'-U__STRICT_ANSI__',
'-ffunction-sections',
'-fdata-sections',
'-Wall',
'-Werror=return-type',
'-free',
'-fipa-pta'], |
Thanks @mhightower83 & @mcspr ! I might need a bit of time to fully digest what you two are telling me.
One question out of curiosity: Are three bytes of zeros the ONLY way to hit an Exception 0? Wouldn't any/all invalid opcode do the same?
I understand the idea behind this to make the stack more informative when it dumps. I may have to try this, but I will probably need some hand-holding when the time comes. But I don't know the full mechanics of how this works. This looks like a parameterized macro substitution, but what source file does this go in? Somewhere in the SoftwareSerial code? I don't see
Does this mean that comparing the ELF files would be able to tell us what is calling
So, I think this means that if the working theory that an ISR is calling
What do you mean by "IDE"? The Arduino IDE? And on PlatformIO, I've only tried Update on my end: Shortly after my last post here, I decided to try to used Core v3.1.1 and simply replace the SoftwareSerial v7.0.0 with 6.12.7. I did this in my Arduino IDE's library, recompiled, and the board is now running for nearly 5 hours with no exceptions. If this proves to be stable, then I will revert to 7.0.0 to confirm that it crashes. Hopefully, it does. Then, I can narrow down the exact release that triggers the exceptions. Now I'm wondering if I can get the two sets of code as close as possible to each other, would that make your ELF comparison idea much easier to pinpoint? |
If the problem is with related to inlining / not inlining, then yes.
No IRAM attribute So, suppose circular_queue::available() is not inlined, should we just match it here and force IRAM section? Arduino/tools/sdk/ld/eagle.app.v6.common.ld.h Lines 143 to 145 in e25f9e9
diff --git a/tools/sdk/ld/eagle.app.v6.common.ld.h b/tools/sdk/ld/eagle.app.v6.common.ld.h
index 051ce170..dd0a4cdd 100644
--- a/tools/sdk/ld/eagle.app.v6.common.ld.h
+++ b/tools/sdk/ld/eagle.app.v6.common.ld.h
@@ -142,6 +142,8 @@ SECTIONS
/* all functional callers are placed in IRAM (including SPI/IRQ callbacks/etc) here */
*(.text._ZNKSt8functionIF*EE*) /* std::function<any(...)>::operator()() const */
+ *(.text._ZN14circular_queue*) /* SoftwareSerial ISR */
+ *(.text._ZNK14circular_queue*) /* SoftwareSerial ISR */
} >iram1_0_seg :iram1_0_phdr
.irom0.text : ALIGN(4) |
I would like to think so. I'll say yes based on page 84 of "Xtensa® Instruction Set Architecture (ISA) Reference Manual"
Also, there appears to be a narrow (two-byte) version of the
Sorry for the confusion, that is the form I use it. It makes it easy to turn off and on for debug builds. size_t available() const
{
__asm__ __volatile__("" ::: "a0", "memory");
int avail = static_cast<int>(m_inPos.load() - m_outPos.load());
if (avail < 0) avail += m_bufSize;
return avail;
} On entry to the function, register |
see esp8266#8830 probably would be annoying to keep in sync reliable regression test would be nice in some form or the other (https://sourceware.org/gdb/current/onlinedocs/gdb#Python-API ?)
I see you're adding My testing using Core v3.1.1 with SoftwareSerial v6.12.7 completed 14h of uptime. I then reverted back to SoftwareSerial v7.0.0 and confirmed multiple Exception 0 crashes within 5 minutes. Finally, re-compiled with SoftwareSerial v6.17.1 (version immediately prior to current release) and confirmed running fine for over 4 hours. And finally, for my own sanity, I downloaded v7.0.0 source directly from the SoftwareSerial repository, compared to what I had pulled out to ensure they are the same (they were) and then re-built once again with v7.0.0 to re-verify that it indeed crashes every few minutes with the same Exception. Core v3.1.1 + SoftwareSerial v7.0.0 => Exception 0 Core v3.0.2 + SoftwareSerial v7.0.0 => Exception 0 Core v3.1.1 => Exception 0 Core v3.1.1 => Exception 0 I zipped up the entire build folder from Arduino IDE's Temp folder. The ELF and other stuff are all in there. arduino_build_822815_Core3.1.1_SoftwareSerial7.0.0_EXCEPTION0.zip
Yes, but I read this and wasn't sure: Table 4–59. Instruction Exceptions under the Exception Option
What could "a legal instruction under illegal conditions" be? Does this suggest that Exception 0 could be encountered somehow besides ILL or ILL.N?
Ahh... This is great. I re-installed core v3.1.1, replaced the code with yours, re-compiled, and ran. As usual, it crashes with exception 0 every few minutes the same way as before. Here're some stack dumps. One thing that I don't understand, but I guess it makes sense: The PC is now decoded to circular_queue.h line 119 (before, it was atomic_base.h line 420. Here's the snippet from circular_queue.h:
I also zipped the build folder of this one as well. ELF file inside, if that would be helpful. |
Great verification.
This is also needed. Was it included in this build? This optimization hinders seeing the call chain in the stack dump. When a function is going to return after a call, the compiler replaces the call with a jump instruction. After a crash, there are no tracks in the stack leading into the crash. |
I made a similar observation on my weather station.
in the MAP File i found the address 4021559c related to circular_queue object |
Oh, no, I totally overlooked that part! I couldn't find any place to add build options in ArduinoIDE, so I'm looking at doing this in PlatformIO now. Is it under Build Options ==> |
Hmm.. Now that I think about it, I can't do this in PlatformIO because I can get the Exception Decoder to work there unless I use the |
Right, I forgot about that. /*@create-file:build.opt@
-fno-optimize-sibling-calls
*/ However, for this to work today on Windows, you will need to the updated ./tools/mkbuildoptglobals.py. This should appear later in the 3.1.2 update. |
I'm so glad for this since I was halfway down a rabbit hole reading about having to be cautious with build flags because there is no dedicated spot for the users to put in flags and using existing options risks overwriting flags defined by platform developers. This was no problem at all since I already replaced I created the I turned on compiler verbosity to make sure the flag got incorporated. I think it did:
Here's are the first first exception decodes:
|
Great! And the ISR caller
|
As an implementer of sibling call optimization in xtensa, I'd like to step in off-topic. |
@jjsuwa-sys3175 I think that is a good suggestion. I too have been thinking it should be added to the docs. I'll take this as a nudge to get it done. |
Just in case, this is from the exact file on my system:
I saw the issue created in the espsoftwareserial repository earlier today. Thank you, @mhightower83 and @mcspr for your guidance and patience. I'm completely out of me element here, but I learned a lot! There are a bunch of us that are experiencing this issue so it will help a lot of people in the community when resolved. I'm looking forward to testing whatever solution winds up being implemented. |
I see we're reverting to SoftwareSerial v6, but does this mean the breaking change in v7 cannot be made to work? Does this affect other Arduino libraries as well like ESP32? |
ESP32 Does not ship SoftwareSerial, iirc it is up to user to keep it updated. v7 has reworked SoftwareSerial::onReceive that changed execution context; there is no longer ::perform_work(), callback is executed in SoftwareSerial ISR directly. This is tangentially related, so far the decision is to revert. |
Prevent system crash with Exception (0) due to software serial bug in the latest core version. esp8266/Arduino#8830
Basic Infos
Platform
Settings in IDE
Problem Description
I (and several others) have an AirGradient Pro kit, which runs on a Wemos D1 Mini. I am seeing repeated Exception 0 crashes every few minutes when compiled with Core v3.1.0 and v3.1.1, but no Exceptions at all (tested over 24 hours) with v3.0.2 and below (I tested down to v3.0.0). This behavior is consistent with the latest Arduino IDE 2.0.3, Arduino IDE 1.8.19, and VSCode+PlatformIO with platform-espressif8266 v4.1.0 (which is updated to Arduino Core v3.1).
Because it crashes every few minutes, I captured a lot of exceptions, but I had no way of decoding in Arduino IDE 2.x and IDE 1.8.x didn't work with the Arduino Core v3.1.x because of Issue #8811, but I finally worked-around the issue to and got v3.1.x core to work with the older Arduino IDE v1.8.x so that I could finally run the ESP Exception Decoder. I think this is what is needed?
It crashes so often, I have decoded dozens of exceptions and each one is always the same:
circular_queue ::available() const at c:\users\ken\appdata\local\arduino15\packages\esp8266\tools\xtensa-lx106-elf-gcc\3.1.0-gcc10.3-e5f9fec\xtensa-lx106-elf\include\c++\10.3.0\bits/atomic_base.h line 420
I'm not a very experienced, but I have spent hours isolating it to the Core v3.1.x update and getting the decoder working. Apologies if I'm not reporting this correctly or if there is some debug procedure I should do that is not obvious to me. Thanks.
Debug Messages
The text was updated successfully, but these errors were encountered: