Skip to content

Stack size when using Core1 #1161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lu7did opened this issue Feb 7, 2023 · 16 comments
Closed

Stack size when using Core1 #1161

lu7did opened this issue Feb 7, 2023 · 16 comments

Comments

@lu7did
Copy link

lu7did commented Feb 7, 2023

Hello,

On a development I'm engaged the functions splits between signal processing with it's associated math, sampling the signal to be processed and the GUI to handle the operator and show results (plus some other miscellanea), all running ok in Core0.
As the signal processing experiences occasionally a time shortage to perform it's scope within a window as required I though to split the workload involving Core1. So the pure math, almost no external libraries duty on Core1 and all the rest on Core0.
I've experienced in the past quite a lot of issues when trying to engage Core1 with other processings but always the overal direction of the analysis pointed to some resources clash or library issue.
A long story short when I perform the split the processing at Core1 stops working after few seconds, different re-arrangements of the code didn't change the behaviour. Then, just stabbing in the dark, I guessed that because of the nature of the activity performed (lot's of calls to math functions with large argument lists and areas being manipulated with quite a few of local
variables) might require a substantial stack to be handled. At the same time all examples of multicore are rather trivial and undemanding for the core1 processing, just focusing on the exchange of information with core0 and the simultaneous operation.

I find no way to manipulate the stack when starting Core1 with setup1()/loop1() so I started the core1 processing using
multicore_launch_core1_with_stack ( void(*entry)(void), uint32_t *stack_bottom, size_t stack_size_bytes )
and gradually increase the stack_size up to a point where ¡voila! everything started to work without changing anything
else when I assigned a stack size above 10000 Bytes. Perfect, stable, operating correctly and without any further issue.

So my question is. ¿What is the stack size assigned by default when the core1 is started using the core functions? and furthermore ¿is there a way to obtain it thru some function or macro primitive?, obviously I'm interested to understand if there is a way to set initially or change it as well.

Thanks in advance, Pedro.

@earlephilhower
Copy link
Owner

The stacks are the default SDK ones: 4K each if you use core 0 and 1, or 8K if you only run core 0.

There is no way of changing it thru the core or IDE (it's defined to sit in the special, high-perf stack region on chip). You could adjust the linker file to change core0/1 stacks, use the SDK's core1 startup routine like you did, or even create a stack on the heap and thunk into is using the same logic as in the SSL code https://github.com/earlephilhower/arduino-pico/blob/master/libraries/WiFi/src/StackThunk.cpp

Alternatively, you could look at FreeRTOS which lets you set per-task stacks as part of the task creation call.

Good luck!

@XZCE
Copy link

XZCE commented Jun 19, 2023

I know this is a closed issue, but...

My use-case is slightly different, the dummy area used for the stack (on both cores) is defined as 2KB (half the space available from scratch_x and scratch_y). This means, I can put my time-sensitive code in scratch_x (cpu1) but if I overflow what's remaining (2KB) even by 1 byte it won't link. My function actually overflows by 360 bytes. Clearing the stack in setup1() and testing how much actually ever gets used, show it consuming 290 bytes (in reality, the locals in loop1() only consume about 28 bytes, and I never return, it's a very tight forever-loop).

Now, I can't "fix" this - there's a libpico link file with all this stuff in it, but if the dummy array was made smaller, and no-one put anything in the scatch areas, it would still have 4KB available like it does now... But it would allow me to fine tune my project.

I'm actually here asking this because the contention on other banks of RAM have made my project fail (every non-flash function from various libraries was being put into the first 64KB bank, making my routine unreliable at various points CPU0 was doing some boring stuff I didn't care how long it took).

Is this something you would consider changing (please)?

@earlephilhower
Copy link
Owner

I'm sorry, but I don't quite understand what you're requesting. If you can throw together a PR it would be easier to look at and discuss. I don't want to change anything in the Pico SDK, but if there's something that's fully compatible w/the way it's done today I'd be fine tweaking things...

@XZCE
Copy link

XZCE commented Jun 20, 2023

The PR would be changing your personal build to set PICO_CORE1_STACK_SIZE=0x100 via a compiler flag.

See multicore.h: (if the define below is not defined, use the CPU0's default size, or else this..)
#define PICO_CORE1_STACK_SIZE 0x800

I.E. Change it from 2KB minimum, to 512B (the most sensible smallest value)

The linker script memmap_default.ld will still calculate its position correctly (see the bottom of the script) using:

__StackOneBottom = __StackOneTop - SIZEOF(.stack1_dummy) <--- dummy was loaded with the core1_stack array in multicore.c

which should now be 0x2004FF00 (instead of 0x2004F800 previously) and the stack (1) top remains 0x20041000 as it was.

This will still allow 4KB of stack, as long as nothing is specified to be put in scratch_x - but it will allow my 2.5KB timing sensitive routine to now link properly in the most appropriate memory region, given I know for sure I don't need more than a few hundred bytes of stack.

A discussion point may be turning on a stack guard, which apparently write protects the top 32 bytes, but that's currently broken anyway since it would go off at the half way point currently. I don't think anyone uses it.

@XZCE
Copy link

XZCE commented Jun 20, 2023

Not sure why I mistyped 0x2004F.. - they should be 0x20040..

@earlephilhower
Copy link
Owner

That seems reasonable, as long as the shrinkage won't allow the heap to grow into it or general code/variables to be assigned to it. Were you going to make a PR, or did you want me to give it a go?

@XZCE
Copy link

XZCE commented Jun 22, 2023

I see my post above was full of typos, I meant add -DPICO_CORE1_STACK_SIZE=0x200 as the compiler flag and my best guess would be in file "make-libpico.sh". I don't have the Pico SDK installed though to try this, I've only got the Arduino stuff.

I would be very grateful if you could give it a go please, and I'll download the new lib file and test it?

Those 2 x 4KB RAM areas are not part of the normal RAM area, so there's no chance the heap will grow into them, or be used for anything else, unless there's a runaway routine in some user's code.

I have put in a temporary work around, making scratch_y 0x300 bytes smaller and scratch_x 0x300 longer - then add a dummy loop1 (which calls my real loop1) where the dummy one has a "volatile char d [0x300];" local (pushing the stack pointer for CPU1 into the right place in its 4KB region). My project has started working reliably now :)

I'm happy to continue to use this work-around if you want to consider this change for a longer time?

@XZCE
Copy link

XZCE commented Jun 25, 2023

I seem to be spending many hours working out why things are not working properly in my project. My latest issue, the SD file-system, has me pondering why SD.open (..., O_RDWR) truncates the opened file (when I didn't say O_TRUNC) where SD.open (..., "r+") doesn't (and works as I would expect). There seems to be so many layers here, Pico Arduino, ESP, Greiman.

I've decided, my project will not be made public, this is nuts.

@XZCE
Copy link

XZCE commented Jun 28, 2023

Looks like there's a bug in your wrapper class, Earle:

const bool append = (mode & O_APPEND) > 0;
if (read & !write) {
return "r";
} else if (!read & write & !append) {
return "w+";
} else if (!read & write & append) {
return "a";
} else if (read & write & !append) {
return "w+"; // may be a bug in FS::mode interpretation, "r+" seems proper
} else if (read & write & append) {
return "a+";
} else {
return "r";
}
}

Should read (at a minimum, but I also question the "w+" at line 185 however making it "w" doesn't "fix" anything there):

    } else if (read &  write & !append) { 
        if (mode & O_TRUNC)
            return "w+";
        else
            return "r+";

My project specifically works on "other people's existing files" so I can't really publish my project that uses this (if someone were to make changes to my "r+" workaround, it may break it and my software will be deleting their data).

@earlephilhower
Copy link
Owner

It's kind of complicated, but the existing code is to match the original Arduino supplied SD.h operation.

It's a direct port of esp8266/Arduino@6dfebec which fixes esp8266/Arduino#8831 .

I'd recommend just using SDFS instead and avoiding the SD cruft if at all possible. It works just like LittleFS.

@XZCE
Copy link

XZCE commented Jun 29, 2023

Ah, OK. I try SDFS instead, thank you.

@XZCE
Copy link

XZCE commented Jul 2, 2023

Someone has left this behind in the WiFi code which I'm now trying to include in my project:

if (pbuf->len != pbuf->tot_len || pbuf->next) {
Serial.println("ERRTOT\r\n");
}

Which doesn't link for me, since I've got "No USB" selected, so I have no "Serial" object (I'm driving the USB in host mode).

@XZCE
Copy link

XZCE commented Jul 9, 2023

Thank you Earle for the quick fix.

Have you thought any more about the linker script change? My project is getting close to working now, it would be good to close that out, if you're happy with it?

Oh, the latest "thing" that was throwing off my precise timing loop - which you may be interested in? My loop1 is in scratch_x and it contains a switch statement (trivially small, but large enough for GCC to build a jump table)... The jump table remains in flash and isn't moved into scratch_x like the code is. Not sure if this image will work, but this is Ghidra showing a reference from scratch_x into flash:

badgcc

I changed over to use GCC's "computed gotos" (&&label) and I can mark the address array to be located in scratch_x. Working.

@earlephilhower
Copy link
Owner

Yeah, the indirect tables were a problem on the ESP8266, too. It's not something we can change here, but in your own linker file you can match *(.rodata._ZTV*) and exclude them from the flash side.

For the stack change, could you throw together a PR? I really don't have any way of testing it works so I'd have to do it blind.

@XZCE
Copy link

XZCE commented Jul 11, 2023

Ah, I think the issue is a moot point, in my final (I probably don't mean final) last edits to get this completed, I've passed the 4KB size of scratch_x with my function and variables. So I've instead set scratch_x to be the last 64KB of normal RAM in the linker script (and edited boards.txt to make normal RAM 192KB). scratch_y is now 8KB. Such is life.

But I would like to thank you again for all the help.

@XZCE
Copy link

XZCE commented Jul 11, 2023

*That was crazy talk, I've now moved Y into the last 4KB of real RAM and made X the whole 8KB special RAM (2 banks)... Got 60KB back... Not that I really use it, but this makes far more sense, CPU0 has no particularly special timing requirements in my project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants