Skip to content

Strange std::begin(x) and std::end(x) with c-strings #8314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
6 tasks done
mcspr opened this issue Sep 17, 2021 · 6 comments
Open
6 tasks done

Strange std::begin(x) and std::end(x) with c-strings #8314

mcspr opened this issue Sep 17, 2021 · 6 comments

Comments

@mcspr
Copy link
Collaborator

mcspr commented Sep 17, 2021

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: ESP-12
  • Core Version: 612e7ff
  • Development Env: PlatformIO
  • Operating System: Windows

Settings in IDE

Defaults

Problem Description

Using std::begin and std::end or the equivalent &buf[0] and &buf[Size], sometimes the resulting string pointer happens to be at a different location than expected. Using addresses(std::begin(...), std::end(...)); outside of the template produces the same results. Referring to &buf[Size-1] aka '\0' does not result in the incorrect pointers though.

Originally noted in the earlephilhower/newlib-xtensa#19 (comment), but it might be something different than string suffix merging / anything related to the toolchain or libc? (and not to continue an already long issue thread. plus, I hope I have not broken toolchain installation somehow)

MCVE Sketch

#ifdef NATIVE
#include <cstdint>
#include <cstdio>
#include <iterator>
#else
#include <Arduino.h>
#endif

void addresses(const char* const begin, const char* const end) {
    ::printf("%p:%p -> (%u)\n", begin, end, end - begin);
}

template <size_t Size>
void addresses(const char (&buf)[Size]) {
    addresses(std::begin(buf), std::end(buf));
}

void test() {
    addresses("");
    addresses(",");
}

#ifndef NATIVE
void setup() {
    Serial.begin(115200);
    delay(1000);
    ::puts("\n\n\n");
    test();
}

void loop() {
    delay(100);
}
#else
int main() {
    ::puts("\n\n\n");
    test();
}
#endif

Debug Messages

0x3ffe87e0:0x3ffe87e1 -> (1)
0x3ffe87e1:0x3ffe87dd -> (4294967292)

Which is not the expected result. Inspecting the binary, "," is actually at the 0x3ffe87db as one would expect from the end (one past the last element of array of 2 elems)

@TD-er
Copy link
Contributor

TD-er commented Sep 17, 2021

Just as a test, I tried it on C++ shell, which does give the expected results.
What is striking is that the reported start address of the 2nd C-string is at the "one past the last element" address of the first string.
I did a few tests on the C++ shell and even when I swap the empty and the 1 char string, they keep the order in which they are defined in the code.
So I wonder why in your run the latter string is placed before. (based on your comment and end address)

Is this run from a completely clean build?

Edit:
I can also get the order of the strings to be swapped in memory by toggling the optimization flag (-O1 or higher) on C++shell.
So apparently that is expected behavior.
This means we're back at what you expressed as a hypothesis a while ago that we might be looking at some old stored address when linking.

@mcspr
Copy link
Collaborator Author

mcspr commented Sep 17, 2021

Could not reproduce with 10.2 & 11.1 x86_64. Try on current xtensa-toolchain gcc locally and run / disassemble test in gdb?
And it seems only string literals are affected. Can't really trigger the issue when using either const char buf[] = "..."; or static const char buf[] = "..."; are passed to the function.

earlephilhower added a commit to earlephilhower/Arduino that referenced this issue Nov 27, 2021
Fixes esp8266#8314 and other hard-to-track bugs.

GCC 10.3 had an issue with addressing constant literals which would result
in crazy offsets being used and random crashes in production.  Update
with an upstream GCC 11 bugfix.
@mcspr
Copy link
Collaborator Author

mcspr commented Nov 30, 2021

Small update based on the gitter discussion and a compressed example. Using and not using -fno-constants-merge as a build flag has different results for the test case.

For the function body such as this

35      void test() {
36          addresses(EMPTY_STRING);
   0x40201048 <+0>:     l32r    a3, 0x4020103c
   0x4020104b <+3>:     l32r    a2, 0x40201040
   0x4020104e <+6>:     addi    a1, a1, -16
   0x40201051 <+9>:     s32i    a0, a1, 12
   0x40201054 <+12>:    call0   0x40201020 <addresses(char const*, char const*)>

37          addresses(COMMA_STRING);
   0x40201057 <+15>:    l32r    a3, 0x40201044
   0x4020105a <+18>:    l32r    a2, 0x4020103c
   0x4020105d <+21>:    call0   0x40201020 <addresses(char const*, char const*)>
   0x40201060 <+24>:    l32i    a0, a1, 12
   0x40201063 <+27>:    addi    a1, a1, 16
   0x40201066 <+30>:    ret.n

The l32r address loads order & values do not change, string addresses inside of them do

0x4020103c:     0x3ffe8633
0x40201040:     0x3ffe8632
0x40201044:     0x3ffe8635

(and the 3ffe86* seems to be located in core_esp8266_phy.cpp.o for some reason)

This is by default, without the switch

0x4020103c:     0x3ffe87e1
0x40201040:     0x3ffe87e0
0x40201044:     0x3ffe87dd

(strings are other way around)

@mcspr
Copy link
Collaborator Author

mcspr commented Dec 1, 2021

A step before linking, assembly has slight differences. Without the flag, strings have .rodata.$origin$.str#.# section (name originates here), while with the flag they are assigned a generic .rodata

diff --git a/a-defaults.s b/a-fno-constants-merge.s
index 8ada173..c8fb2c2 100644
--- a/a-defaults.s
+++ b/a-fno-constants-merge.s
@@ -1,6 +1,6 @@
        .file   "a.cpp"
        .text
-       .section        .rodata._Z9addressesPKcS0_.str1.1,"aMS",@progbits,1
+       .section        .rodata
 .LC0:
        .string "%p:%p -> (%u)\n"
        .section        .text._Z9addressesPKcS0_,"ax",@progbits
@@ -21,7 +21,7 @@ _Z9addressesPKcS0_:
        addi    sp, sp, 16
        ret.n
        .size   _Z9addressesPKcS0_, .-_Z9addressesPKcS0_
-       .section        .rodata._Z4testv.str1.1,"aMS",@progbits,1
+       .section        .rodata
 .LC2:
        .string ""
 .LC5:
@@ -48,7 +48,7 @@ _Z4testv:
        addi    sp, sp, 16
        ret.n
        .size   _Z4testv, .-_Z4testv
-       .section        .rodata.setup.str1.1,"aMS",@progbits,1
+       .section        .rodata
 .LC9:
        .string "\n\n\n"
        .section        .text.setup,"ax",@progbits

@mcspr
Copy link
Collaborator Author

mcspr commented Dec 1, 2021

Another test, but now with some custom section that does not start with .rodata, but is set through code to be 'mergeable' (notice that both strings must be in the same section or match into some section by the .ld script. here, nothing matches '.mystring' so it's just dumped into the bss b/c of the *(COMMON) matcher there). Both ld and gcc are with the default build flags, both relaxation and gcc merging are enabled.

34      #define MERGE_SECTION(NAME) __attribute__((section( "\"" NAME "\", \"aSM\", @progbits, 1 #")))
35      void test() {
36          static const char EMPTY_STRING[] MERGE_SECTION(".mystrings") = "";
37          addresses(EMPTY_STRING);
   0x40201048 <+0>:     l32r    a3, 0x4020103c
   0x4020104b <+3>:     l32r    a2, 0x40201040
   0x4020104e <+6>:     addi    a1, a1, -16
   0x40201051 <+9>:     s32i    a0, a1, 12
   0x40201054 <+12>:    call0   0x40201020 <addresses(char const*, char const*)>

38          static const char COMMA_STRING[] MERGE_SECTION(".mystrings") = ",";
39          addresses(COMMA_STRING);
   0x40201057 <+15>:    l32r    a3, 0x40201040
   0x4020105a <+18>:    l32r    a2, 0x40201044
   0x4020105d <+21>:    call0   0x40201020 <addresses(char const*, char const*)>
   0x40201060 <+24>:    l32i    a0, a1, 12
   0x40201063 <+27>:    addi    a1, a1, 16
   0x40201066 <+30>:    ret.n
> xtensa-lx106-elf-objdump -s -j .mystrings .pio/build/d1_mini/firmware.elf
.pio/build/d1_mini/firmware.elf:     file format elf32-xtensa-le

Contents of section .mystrings:
 3ffe8998 2c00                                 ,.

And it does match the expected layout.

0x4020103c:     0x3ffe899a << end for both "," and "" (merged)
0x40201040:     0x3ffe8999 << '\0', start for ""
0x40201044:     0x3ffe8998 << ',', start for ","

edit: probably was to hasty to blame .rodata, only pattern so far is placement in the section and in what order they are processed by the linker? for example, this also breaks when COMMA_STRING and EMPTY_STRING are swapped in the code. Here, ',' is used both as start of the "," and as the end for empty string. Pointers to the start of the "" and end of the "," are saved

static const char COMMA_STRING[] MERGE_SECTION(".rodata.bleh") = ",";
static const char EMPTY_STRING[] MERGE_SECTION(".rodata.bleh") = "";
> bash dump.sh
0x4020103c:     0x3ffe87df
0x40201040:     0x3ffe87de
0x40201044:     0x3ffe87e1
> xtensa-lx106-elf-objdump -s -j .rodata .pio/build/d1_mini/firmware.elf | grep -E '3ffe87[d,e]'
 3ffe87d0 70202d3e 20282575 290a000a 0a0a002c  p -> (%u)......,
 3ffe87e0 00281a14 00000000 00000000 00000000  .(..............
> xtensa-lx106-elf-objdump -s -j .rodata.bleh .pio/build/d1_mini/src/a.cpp.o

.pio/build/d1_mini/src/a.cpp.o:     file format elf32-xtensa-le

Contents of section .rodata.bleh:
 0000 002c00                               .,.

When EMPTY_STRING is first, COMMA_STRING second in the code

> bash dump.sh
0x4020103c:     0x3ffe87df
0x40201040:     0x3ffe87de
0x40201044:     0x3ffe87df
> xtensa-lx106-elf-objdump -s -j .rodata .pio/build/d1_mini/firmware.elf | grep -E '3ffe87[d,e]'
 3ffe87d0 70202d3e 20282575 290a000a 0a0a002c  p -> (%u)......,
 3ffe87e0 00281a14 00000000 00000000 00000000  .(..............
> xtensa-lx106-elf-objdump -s -j .rodata.bleh .pio/build/d1_mini/src/a.cpp.o

.pio/build/d1_mini/src/a.cpp.o:     file format elf32-xtensa-le

Contents of section .rodata.bleh:
 0000 2c0000                               ,..

@mcspr
Copy link
Collaborator Author

mcspr commented Dec 2, 2021

(Unsurprisingly?), can also reproduce on the current arduino-esp32 2.0.1, just a matter of making strings appear in a certain order. But, can't reproduce just with the xtensa toolchain or godbolt to minimize this even more :/

This is arduino-cli.exe compile --clean -b esp32:esp32:esp32 aka basic esp32 dev board

Dump of assembler code for function test():
testme.ino:
11      void test() {
   0x400d0f40 <+0>:     entry   a1, 32

12          addresses("");
   0x400d0f43 <+3>:     l32r    a11, 0x400d0024 <_stext+4>
   0x400d0f46 <+6>:     l32r    a10, 0x400d0028 <_stext+8>
   0x400d0f49 <+9>:     call8   0x400d0f2c <addresses(char const*, char const*)>

13          addresses(",");
   0x400d0f4c <+12>:    l32r    a11, 0x400d002c <_stext+12>
   0x400d0f4f <+15>:    l32r    a10, 0x400d0024 <_stext+4>
   0x400d0f52 <+18>:    call8   0x400d0f2c <addresses(char const*, char const*)>
   0x400d0f55 <+21>:    retw.n
End of assembler dump.
0x400d0024 <_stext+4>:  0x3f40264d
0x400d0028 <_stext+8>:  0x3f40264c
0x400d002c <_stext+12>: 0x3f400131
> xtensa-esp32-elf-objdump -s -j .flash.rodata .\testme.ino.elf | select-string 3f40264
 3f402640 69746174 696f6e20 4572726f 723a2043  itation Error: C
> xtensa-esp32-elf-objdump -s -j .flash.rodata .\testme.ino.elf | select-string '3f4001[2,3]'
 3f400120 25703a25 70202d3e 20282575 290a002c  %p:%p -> (%u)..,
 3f400130 002a0000 00000000 00000000 20b20e40  .*.......... ..@

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants