Skip to content

Native support for AVR instrinsics #711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Patryk27 opened this issue Oct 14, 2024 · 23 comments
Open

Native support for AVR instrinsics #711

Patryk27 opened this issue Oct 14, 2024 · 23 comments

Comments

@Patryk27
Copy link
Contributor

At the moment compiler-builtins isn't useful for AVR, because that platform uses a custom calling convention for intrinsics - supporting AVR here would require:

  1. Implementing this custom calling convention in LLVM.
  2. Exposing it in rustc.
  3. Finally, using it within compiler-builtins.

There are no ongoing plans to do so yet, but over rust-lang/rust#131651 it was pointed out it would be nice to have some kind of message of this issue left, so here we go 🙂

@tgross35
Copy link
Contributor

tgross35 commented Oct 14, 2024

Thanks for filing this. I don't think we are blocked on LLVM having support for the calling convention, instead we should be able to use naked functions to translate the cc like we do for aeabi intrinsics

// NOTE This function and the ones below are implemented using assembly because they are using a
// custom calling convention which can't be implemented using a normal Rust function.
#[naked]
#[cfg(not(target_env = "msvc"))]
pub unsafe extern "C" fn __aeabi_uidivmod() {
core::arch::naked_asm!(
"push {{lr}}",
"sub sp, sp, #4",
"mov r2, sp",
bl!("__udivmodsi4"),
"ldr r1, [sp]",
"add sp, sp, #4",
"pop {{pc}}",
);
}
.

I'm not positive how this would interact with the clobbers, however.

@Patryk27
Copy link
Contributor Author

Oh, sure, that seems like a good idea 🙂

@tgross35
Copy link
Contributor

tgross35 commented Oct 14, 2024

Separately you may be interested in getting AVR asm to stable. This isn't a blocker for this crate but it may be useful in user code rust-lang/rust#93335

@tgross35
Copy link
Contributor

Also, do you know if there is a version of these symbols in assembly that don't come from libgcc? Some of these are probably easy enough that we could provide our own assembly implementations, but we need to be careful with licensing.

@Patryk27
Copy link
Contributor Author

Also, do you know if there is a version of these symbols in assembly that don't come from libgcc?

I'm aware only of the libgcc implementation, unfortunately.

@tgross35
Copy link
Contributor

tgross35 commented Feb 5, 2025

After rust-lang/rust#131651 merges it should be possible to fix the AVRTiny asm convention, at which point I think we will be okay to start using AVR assembly in this repo. The #[naked] wrapper to convert the ABI and forward to our implementations is probably still the best option.

@tgross35
Copy link
Contributor

tgross35 commented Mar 4, 2025

@Patryk27 do you happen to know which intrinsics we are actually missing? Testing with the below:

#![no_std]

#[unsafe(no_mangle)]
pub fn call_umulhisi3(a: &u16, b: &u16, r: &mut u32) {
    *r = *a as u32 * *b as u32;
}

#[unsafe(no_mangle)]
pub fn call_mulhisi3(a: &i16, b: &i16, r: &mut i32) {
    *r = *a as i32 * *b as i32;
}

#[unsafe(no_mangle)]
pub fn call_udivmodqi4(a: &u8, b: &u8, r1: &mut u8, r2: &mut u8) {
    (*r1, *r2) = (*a / *b, *a % *b);
}

#[unsafe(no_mangle)]
pub fn call_divmodqi4(a: &i8, b: &i8, r1: &mut i8, r2: &mut i8) {
    (*r1, *r2) = (*a / *b, *a % *b);
}

#[unsafe(no_mangle)]
pub fn call_udivmodhi4(a: &u16, b: &u16, r1: &mut u16, r2: &mut u16) {
    (*r1, *r2) = (*a / *b, *a % *b);
}

#[unsafe(no_mangle)]
pub fn call_divmodhi4(a: &i16, b: &i16, r1: &mut i16, r2: &mut i16) {
    (*r1, *r2) = (*a / *b, *a % *b);
}

On the atmega328p target-cpu (unsure whether this makes a difference), the only intrinsics from the list at https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention are the 8- and 16-bit integer division __u?divmod[hq]i4. For the multiplication, LLVM seems to widen the integers to u32 and call the default builtin __mulsi3 which I don't think would have the special ABI. Do you typically see more missing symbols (beyond small int division) when you try linking projects without the AVR libgcc?

@Patryk27
Copy link
Contributor Author

Status: was a little bit busy previous week, I'll take a look at those intrinsics and your example in a couple of days 🙂

@Patryk27
Copy link
Contributor Author

You're right - it seems that the only Funny Intrinsics™ used at the moment are __udivmodqi4, __divmodqi4, __udivmodhi4, and __divmodhi4.

I have tried to summon __umulhisi3 with:

#[inline(never)]
pub fn fun(a: &u16, b: &u16, x: &mut u16, y: &mut u16) {
    (*x, *y) = a.widening_mul(*b);
}

... but even this causes LLVM to go with 16b->32b extension followed by 32b x 32b multiplication (aka __mulsi3).

All of those observations match the intrinsics actually hardcoded into the codegen:

https://github.com/llvm/llvm-project/blob/c979ce7e362908a361cee721d2a6db4bcd65be1f/llvm/lib/Target/AVR/AVRISelLowering.cpp#L201

... which confirms that we'd only have to provide __u?divmod[hq]i4.

Curiously enough, some time ago I thought that ABI conflicts were the reason why f32 is broken on AVR:

#527

... but since they don't have special ABI, it must be a codegen bug, actually (this particular investigation will follow on Rahix/avr-hal#641).

@Patryk27
Copy link
Contributor Author

Patryk27 commented Mar 15, 2025

Alright, there is another kind of ABI mismatch happening, though!

Context:

tl;dr intrinsics like those:

pub extern "C" fn __unordsf2(a: f32, b: f32) -> i32 {

... should (most likely?) return c_int instead of i32, as one some platforms (AVR notably) those are actually (expected to be) 16-bit numbers.

Edit: actually, it's not even c_int - that's how LLVM defines it:

Should be easy to implement with a regular cfg-guarded type alias.

@Patryk27
Copy link
Contributor Author

Patryk27 commented Mar 15, 2025

It seems there's an ABI difference in regards to 32-bit division and remainder as well.

Instead of compiler-builtins' fn divrem(T, T, &mut Option<T>) -> T or even GCC's fn divrem(T, T, bool) -> T, AVR expects fn divrem(T, T) -> (T, T) (even for i32/u32) - it's not written down in GCC's docs, but that's what the codegen says:

https://github.com/llvm/llvm-project/blob/1e6ba3cd2fe96be00b6ed6ba28b3d9f9271d784d/llvm/lib/Target/AVR/AVRISelLowering.cpp#L572

... and that's what also came out when I was running avr-tester's random-math test-suite on compiler-builtins without any #[avr_skip] (i.e. it printed trash for division and remainder, which prompted me to investigate those ops).

Now that I have a deeper understanding of how the pieces come together, I see a light out of the tunnel - I'm preparing a pull request that gets rid of all/most #[avr_skip]s, we'll see how it goes.

@Patryk27
Copy link
Contributor Author

Patryk27 commented Mar 15, 2025

Ok, it seems that implementing __udivmodsi4 this way:

#[cfg(target_arch = "avr")]
intrinsics! {
    #[maybe_use_optimized_c_shim]
    pub extern "C" fn __udivmodsi4(n: u32, d: u32) -> u64 {
        let (div, rem) = u32_div_rem(n, d);

        ((rem as u64) << 32) | (div as u64)
    }
}

... yields a working intrinsic!

Unfortunately we can't go with the simpler -> (u32, u32), because that forces a stack allocation instead of passing stuff through registers, but the bit-shifting approach is good enough - LLVM seems to correctly understand it as mostly a regalloc-hinting thingie.

@tones111
Copy link
Contributor

tones111 commented Apr 10, 2025

I'm experimenting with __udivmodqi4 and get the same code gen using naked_asm with the gcc implementation.

    #[naked]
    pub unsafe extern "C" fn __udivmodqi4() {
        // r25: remainder
        // r24: dividend, quotient
        // r22: divisor
        // r23: loop counter
        core::arch::naked_asm!(
            "clr r25",      // clear remainder
            "ldi r23, 8",   // init loop counter
            "lsl r24",      // shift dividend
            // brne target
            "rol r25",      // shift dividend into remainder
            "cp r25, r22",  // compare remainder & divisor
            "brcs 2",       // remainder <= divisor
            "sub r25, r22", // restore remainder
            // brcs target
            "rol r24",      // shift dividend (with CARRY)
            "dec r23",      // decrement loop counter
            "brne -14",
            "com r24",      // complement result (C flag was complemented in loop)
            "ret",
        );
    }

naked_asm errors if you try to use non-integer loop labels, so I had to hard-code the offsets.

I was thinking it would be fun to try implementing the algorithm in Rust, but I don't know how to express the mapping between the input/output parameters and the registers required by the calling convention. Does that limit us to an asm-only solution?

I assume we'd need to implement the calling convention somewhere to allow us to pub extern "AvrDiv" fn __udivmodqi4(n: u8, d: u8) -> (u8, u8) { ... }

@tgross35
Copy link
Contributor

tgross35 commented Apr 10, 2025

naked_asm errors if you try to use non-integer loop labels, so I had to hard-code the offsets.

Integer labels should work here through right? So brne target gets 1: and brcs target gets 2:, the jumps become brcs 2f and brne 1b.

I was thinking it would be fun to try implementing the algorithm in Rust, but I don't know how to express the mapping between the input/output parameters and the registers required by the calling convention. Does that limit us to an asm-only solution?

I assume we'd need to implement the calling convention somewhere to allow us to pub extern "AvrDiv" fn __udivmodqi4(n: u8, d: u8) -> (u8, u8) { ... }

It is possible to add compiler support for this ABI and implement it that way. That probably doesn't have to happen though - it's only needed a couple of places in this crate and likely nowhere else in the ecosystem, so lang/compiler support would be pretty heavy handed.

One option is to "trampoline" ABIs by using naked_asm! to interface between the custom ABI and whatever extern "C" is then call A rust-written function, like we do around here

#[naked]
#[cfg(not(target_env = "msvc"))]
pub unsafe extern "C" fn __aeabi_uidivmod() {
core::arch::naked_asm!(
"push {{lr}}",
"sub sp, sp, #4",
"mov r2, sp",
bl!("__udivmodsi4"),
"ldr r1, [sp]",
"add sp, sp, #4",
"pop {{pc}}",
);
}
#[naked]
pub unsafe extern "C" fn __aeabi_uldivmod() {
core::arch::naked_asm!(
"push {{r4, lr}}",
"sub sp, sp, #16",
"add r4, sp, #8",
"str r4, [sp]",
bl!("__udivmoddi4"),
"ldr r2, [sp, #8]",
"ldr r3, [sp, #12]",
"add sp, sp, #16",
"pop {{r4, pc}}",
);
}
#[naked]
pub unsafe extern "C" fn __aeabi_idivmod() {
core::arch::naked_asm!(
"push {{r0, r1, r4, lr}}",
bl!("__aeabi_idiv"),
"pop {{r1, r2}}",
"muls r2, r2, r0",
"subs r1, r1, r2",
"pop {{r4, pc}}",
);
}
#[naked]
pub unsafe extern "C" fn __aeabi_ldivmod() {
core::arch::naked_asm!(
"push {{r4, lr}}",
"sub sp, sp, #16",
"add r4, sp, #8",
"str r4, [sp]",
bl!("__divmoddi4"),
"ldr r2, [sp, #8]",
"ldr r3, [sp, #12]",
"add sp, sp, #16",
"pop {{r4, pc}}",
);
}
. The routine is so short though that I assume the overhead wouldn't be worth it.

So, feel free to submit a PR adding what you have there :) we don't need to add everything at once. edit: you beat me to it

@AaronKutch
Copy link
Contributor

I want to mention that I've known that the assembly for my division functions would blow up into an absolute monstrosity the more levels of functions there are and the smaller the registers are. I oriented around delegating to half size divisions to make a larger division, but LLVM would inline nest things which would tend to quadratic code complexity. The way I would do it for 8 and maybe 16 bit microcontrollers, given enough time, is to use a common set of byte-slice-based functions for addition, multiplication, etc. The same division function can be used for everything from u16 to u128 by transmuting &uX to &[u8] and &mut uX to &mut [u8], then giving these to the common functions. This would give the smallest binary and stack size which is really the priority on these small devices. This is actually what my awint crate does, in fact it has special casing to use u8 digits on AVR to make it more than viable. However, in the case of division specifically, good 'ole binary division is the best approach considering that multiplication has do be done bit by bit anyway.

// The following example is the most straightforward translation from the way binary
for example. Probably, the one good optimization to make is that there should be combined lhs += rhs << shift and lhs -= rhs << shift functions for implementing multiplication and division. You can brute-force test smaller cases to make sure they are correct, and rely on the fuzzing framework to check the rest.

@tgross35
Copy link
Contributor

For my understanding @Patryk27 @tones111, do you have any idea if more complex intrinsics tend to work on AVR since #791 should be in recent nightlies? Specifically ops involving i128, f128, and f16 which all tend to make ABI edge cases show up.

@tones111
Copy link
Contributor

The unsigned division intrinsics have merged!

The signed equivalents are more complicated as the semantics differs between languages. The book "Hacker's Delight" (2nd edition, chapter 9) describes three different implementations (truncating, modulus, or floor) and lists some languages that have made differing choices.

From what I can find the Rust reference says

  • "integer division rounds towards zero" and "remainder defined with truncating division
  • "Rust uses a remainder defined with truncating division. Given remainder = dividend % divisor, the remainder will have the same sign as the dividend.

Does the intrinsic need to implement Rust's division semantics or something else?

Also confusing is that Rust's Div trait says
"This operation will panic if other == 0 or the division results in overflow", however, it appears to give results for u8::MIN, which I would have expected to overflow. (playground)

@tgross35
Copy link
Contributor

Does the intrinsic need to implement Rust's division semantics or something else?

Usually language-specific semantics are implemented in the standard library, the builtins are closer to emulating behavior on real hardware As an example, the division routines are allowed to create UB if the divisor is 0.

The backends pick the actual semantics here, but LLVM and Cranelift almost always match GCC. It should act the same as __divmodsi4 and the other sizes that we have already.

Also confusing is that Rust's Div trait says
"This operation will panic if other == 0 or the division results in overflow", however, it appears to give results for u8::MIN, which I would have expected to overflow. (playground)

The linked example is i8, -128/-128 = 1 should be correct. Replacing i8 with u8 does panic with 0/0 as expected.

@tones111
Copy link
Contributor

tones111 commented Apr 17, 2025

After staring at __divmodqi4 for far too long I'm struggling to come up with anything sufficiently different from the gcc implementation.

Everything I've seen says to use an unsigned division and modify the sign of the results based on the sign of the inputs. Mathematically the gcc implementation is as straightforward as it gets.

  • compute and store the sign bits
  • convert to unsigned values
  • perform unsigned division
  • apply sign bits

At the cost of a few extra instructions I've packed the remainder sign into R0 instead of the T flag. There's no alternative algorithm for negation (implemented in the ALU). I'm unsure how to proceed.

    pub unsafe extern "C" fn __divmodqi4() {
        // compute signed 8-bit `n / d` and `n % d`.
        //
        // Note: GCC implements a [non-standard calling convention](https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention) for this function.
        // Inputs:
        //     R24: dividend
        //     R22: divisor
        // Outputs:
        //     R24: quotient  (dividend / divisor)
        //     R25: remainder (dividend % divisor)
        // Clobbers:
        //     R23: loop counter
        //     R0:  sign bits
        //     T:   unused
        core::arch::naked_asm!(
            // This assembly routine adjusts the inputs to perform an unsigned division.
            // The quotient is negative when the dividend and divisor signs differ.
            // The remainder has the same sign as the dividend.
            "mov R0, R22",
            "eor R0, R24",  // R0.7 is the quotient sign
            "cbr R0, 0",    // R0.0 is the remainder sign

            "tst R24",
            "brpos 1f",     // if dividend is negative
            "neg R24",      //    negate to a positive value
            "sbr R0, 0",    //    set remainder sign
            "1:",

            "sbrc R22, 7",  // if divisor is negative
            "neg R22",      //    negate to a positive value

            "call __udivmodqi4",    // perform unsigned division
                                    // R24 = quotient, R25 = remainder
            "sbrc R0, 7",   // apply quotient sign
            "neg R24",

            "sbrc R0, 0",   // apply remainder sign
            "neg R25",
            "ret"
        );
    }

@Patryk27
Copy link
Contributor Author

I'm not a lawyer, but I don't think we have to be 100% different from GCC, it's mostly about "was the code copy-pasted verbatim" etc.; I'd guess that what you propose - assuming the algorithm is alright, haven't checked this one - is good.

@tgross35?

@tgross35
Copy link
Contributor

IIUC even beyond avoiding copy+paste you can't take inspiration from the source, and that's the part that matters rather than how different it is. Which is tough when you have read the source, I don't know what the safe way out of this is.

@joshtriplett is much more the expert here than me

@tgross35
Copy link
Contributor

It's might also be feasible to put all of this into a new file that is GPL and say that on AVR only you will be using GPL sources (already the case for any users that are linking GCC's runtime libs), which is nice because it would let you take the exact implementation. The downside is of course how the licensing changes propagate everywhere, I'm not sure what that would do to rust-lang/rust.

@AaronKutch
Copy link
Contributor

AaronKutch commented Apr 18, 2025

The approach given is the only way I've ever known to do signed division in the many implementations I've made over the years. I don't think there is any licensing issue as long as you wrote the assembly from scratch. The only special cases are related to the iX::MIN / -1 case and the division-by-zero cases. For iX::MIN / -1, the right answer is iX::MIN which is given by using wrapping operations, which should be automatically given by using wrapping operations in the routine above (unless doing saturating or checked division in which case you have to special case it). For the division-by-zero cases, you probably want to implement the RISC-V behavior from #485 . https://five-embeddev.com/riscv-user-isa-manual/Priv-v1.12/m.html .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants