Native support for AVR instrinsics #711

Patryk27 · 2024-10-14T13:41:14Z

At the moment compiler-builtins isn't useful for AVR, because that platform uses a custom calling convention for intrinsics - supporting AVR here would require:

Implementing this custom calling convention in LLVM.
Exposing it in rustc.
Finally, using it within compiler-builtins.

There are no ongoing plans to do so yet, but over rust-lang/rust#131651 it was pointed out it would be nice to have some kind of message of this issue left, so here we go 🙂

tgross35 · 2024-10-14T17:29:06Z

Thanks for filing this. I don't think we are blocked on LLVM having support for the calling convention, instead we should be able to use naked functions to translate the cc like we do for aeabi intrinsics

compiler-builtins/src/arm.rs

Lines 21 to 35 in cb06005

    
           // NOTE This function and the ones below are implemented using assembly because they are using a 
        
           // custom calling convention which can't be implemented using a normal Rust function. 
        
           #[naked] 
        
           #[cfg(not(target_env = "msvc"))] 
        
           pub unsafe extern "C" fn __aeabi_uidivmod() { 
        
               core::arch::naked_asm!( 
        
                   "push {{lr}}", 
        
                   "sub sp, sp, #4", 
        
                   "mov r2, sp", 
        
                   bl!("__udivmodsi4"), 
        
                   "ldr r1, [sp]", 
        
                   "add sp, sp, #4", 
        
                   "pop {{pc}}", 
        
               ); 
        
           }

.

I'm not positive how this would interact with the clobbers, however.

Patryk27 · 2024-10-14T20:17:26Z

Oh, sure, that seems like a good idea 🙂

tgross35 · 2024-10-14T20:23:48Z

Separately you may be interested in getting AVR asm to stable. This isn't a blocker for this crate but it may be useful in user code rust-lang/rust#93335

tgross35 · 2024-10-14T20:26:54Z

Also, do you know if there is a version of these symbols in assembly that don't come from libgcc? Some of these are probably easy enough that we could provide our own assembly implementations, but we need to be careful with licensing.

Patryk27 · 2024-10-15T09:11:15Z

Also, do you know if there is a version of these symbols in assembly that don't come from libgcc?

I'm aware only of the libgcc implementation, unfortunately.

tgross35 · 2025-02-05T10:01:00Z

After rust-lang/rust#131651 merges it should be possible to fix the AVRTiny asm convention, at which point I think we will be okay to start using AVR assembly in this repo. The #[naked] wrapper to convert the ABI and forward to our implementations is probably still the best option.

tgross35 · 2025-03-04T11:21:05Z

@Patryk27 do you happen to know which intrinsics we are actually missing? Testing with the below:

#![no_std]

#[unsafe(no_mangle)]
pub fn call_umulhisi3(a: &u16, b: &u16, r: &mut u32) {
    *r = *a as u32 * *b as u32;
}

#[unsafe(no_mangle)]
pub fn call_mulhisi3(a: &i16, b: &i16, r: &mut i32) {
    *r = *a as i32 * *b as i32;
}

#[unsafe(no_mangle)]
pub fn call_udivmodqi4(a: &u8, b: &u8, r1: &mut u8, r2: &mut u8) {
    (*r1, *r2) = (*a / *b, *a % *b);
}

#[unsafe(no_mangle)]
pub fn call_divmodqi4(a: &i8, b: &i8, r1: &mut i8, r2: &mut i8) {
    (*r1, *r2) = (*a / *b, *a % *b);
}

#[unsafe(no_mangle)]
pub fn call_udivmodhi4(a: &u16, b: &u16, r1: &mut u16, r2: &mut u16) {
    (*r1, *r2) = (*a / *b, *a % *b);
}

#[unsafe(no_mangle)]
pub fn call_divmodhi4(a: &i16, b: &i16, r1: &mut i16, r2: &mut i16) {
    (*r1, *r2) = (*a / *b, *a % *b);
}

On the atmega328p target-cpu (unsure whether this makes a difference), the only intrinsics from the list at https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention are the 8- and 16-bit integer division __u?divmod[hq]i4. For the multiplication, LLVM seems to widen the integers to u32 and call the default builtin __mulsi3 which I don't think would have the special ABI. Do you typically see more missing symbols (beyond small int division) when you try linking projects without the AVR libgcc?

Patryk27 · 2025-03-11T19:15:00Z

Status: was a little bit busy previous week, I'll take a look at those intrinsics and your example in a couple of days 🙂

Patryk27 · 2025-03-14T15:53:21Z

You're right - it seems that the only Funny Intrinsics™ used at the moment are __udivmodqi4, __divmodqi4, __udivmodhi4, and __divmodhi4.

I have tried to summon __umulhisi3 with:

#[inline(never)]
pub fn fun(a: &u16, b: &u16, x: &mut u16, y: &mut u16) {
    (*x, *y) = a.widening_mul(*b);
}

... but even this causes LLVM to go with 16b->32b extension followed by 32b x 32b multiplication (aka __mulsi3).

All of those observations match the intrinsics actually hardcoded into the codegen:

https://github.com/llvm/llvm-project/blob/c979ce7e362908a361cee721d2a6db4bcd65be1f/llvm/lib/Target/AVR/AVRISelLowering.cpp#L201

... which confirms that we'd only have to provide __u?divmod[hq]i4.

Curiously enough, some time ago I thought that ABI conflicts were the reason why f32 is broken on AVR:

#527

... but since they don't have special ABI, it must be a codegen bug, actually (this particular investigation will follow on Rahix/avr-hal#641).

Patryk27 · 2025-03-15T11:07:48Z

Alright, there is another kind of ABI mismatch happening, though!

Context:

Problem with floats on atmega32u4 - `undefined reference to '__extendsfdf2'` Rahix/avr-hal#641 (comment)
Problem with floats on atmega32u4 - `undefined reference to '__extendsfdf2'` Rahix/avr-hal#641 (comment)

tl;dr intrinsics like those:

compiler-builtins/src/float/cmp.rs

Line 114 in 7bec089

pub extern "C" fn __unordsf2(a: f32, b: f32) -> i32 {

... should (most likely?) return c_int instead of i32, as one some platforms (AVR notably) those are actually (expected to be) 16-bit numbers.

Edit: actually, it's not even c_int - that's how LLVM defines it:

Should be easy to implement with a regular cfg-guarded type alias.

Patryk27 · 2025-03-15T12:40:43Z

It seems there's an ABI difference in regards to 32-bit division and remainder as well.

Instead of compiler-builtins' fn divrem(T, T, &mut Option<T>) -> T or even GCC's fn divrem(T, T, bool) -> T, AVR expects fn divrem(T, T) -> (T, T) (even for i32/u32) - it's not written down in GCC's docs, but that's what the codegen says:

https://github.com/llvm/llvm-project/blob/1e6ba3cd2fe96be00b6ed6ba28b3d9f9271d784d/llvm/lib/Target/AVR/AVRISelLowering.cpp#L572

... and that's what also came out when I was running avr-tester's random-math test-suite on compiler-builtins without any #[avr_skip] (i.e. it printed trash for division and remainder, which prompted me to investigate those ops).

Now that I have a deeper understanding of how the pieces come together, I see a light out of the tunnel - I'm preparing a pull request that gets rid of all/most #[avr_skip]s, we'll see how it goes.

Patryk27 · 2025-03-15T18:33:45Z

Ok, it seems that implementing __udivmodsi4 this way:

#[cfg(target_arch = "avr")]
intrinsics! {
    #[maybe_use_optimized_c_shim]
    pub extern "C" fn __udivmodsi4(n: u32, d: u32) -> u64 {
        let (div, rem) = u32_div_rem(n, d);

        ((rem as u64) << 32) | (div as u64)
    }
}

... yields a working intrinsic!

Unfortunately we can't go with the simpler -> (u32, u32), because that forces a stack allocation instead of passing stuff through registers, but the bit-shifting approach is good enough - LLVM seems to correctly understand it as mostly a regalloc-hinting thingie.

tones111 · 2025-04-10T14:53:47Z

I'm experimenting with __udivmodqi4 and get the same code gen using naked_asm with the gcc implementation.

    #[naked]
    pub unsafe extern "C" fn __udivmodqi4() {
        // r25: remainder
        // r24: dividend, quotient
        // r22: divisor
        // r23: loop counter
        core::arch::naked_asm!(
            "clr r25",      // clear remainder
            "ldi r23, 8",   // init loop counter
            "lsl r24",      // shift dividend
            // brne target
            "rol r25",      // shift dividend into remainder
            "cp r25, r22",  // compare remainder & divisor
            "brcs 2",       // remainder <= divisor
            "sub r25, r22", // restore remainder
            // brcs target
            "rol r24",      // shift dividend (with CARRY)
            "dec r23",      // decrement loop counter
            "brne -14",
            "com r24",      // complement result (C flag was complemented in loop)
            "ret",
        );
    }

naked_asm errors if you try to use non-integer loop labels, so I had to hard-code the offsets.

I was thinking it would be fun to try implementing the algorithm in Rust, but I don't know how to express the mapping between the input/output parameters and the registers required by the calling convention. Does that limit us to an asm-only solution?

I assume we'd need to implement the calling convention somewhere to allow us to pub extern "AvrDiv" fn __udivmodqi4(n: u8, d: u8) -> (u8, u8) { ... }

tgross35 · 2025-04-10T18:09:17Z

naked_asm errors if you try to use non-integer loop labels, so I had to hard-code the offsets.

Integer labels should work here through right? So brne target gets 1: and brcs target gets 2:, the jumps become brcs 2f and brne 1b.

I was thinking it would be fun to try implementing the algorithm in Rust, but I don't know how to express the mapping between the input/output parameters and the registers required by the calling convention. Does that limit us to an asm-only solution?

I assume we'd need to implement the calling convention somewhere to allow us to pub extern "AvrDiv" fn __udivmodqi4(n: u8, d: u8) -> (u8, u8) { ... }

It is possible to add compiler support for this ABI and implement it that way. That probably doesn't have to happen though - it's only needed a couple of places in this crate and likely nowhere else in the ecosystem, so lang/compiler support would be pretty heavy handed.

One option is to "trampoline" ABIs by using naked_asm! to interface between the custom ABI and whatever extern "C" is then call A rust-written function, like we do around here

compiler-builtins/compiler-builtins/src/arm.rs

Lines 23 to 77 in 9978a8b

    
           #[naked] 
        
           #[cfg(not(target_env = "msvc"))] 
        
           pub unsafe extern "C" fn __aeabi_uidivmod() { 
        
               core::arch::naked_asm!( 
        
                   "push {{lr}}", 
        
                   "sub sp, sp, #4", 
        
                   "mov r2, sp", 
        
                   bl!("__udivmodsi4"), 
        
                   "ldr r1, [sp]", 
        
                   "add sp, sp, #4", 
        
                   "pop {{pc}}", 
        
               ); 
        
           } 
        
           #[naked] 
        
           pub unsafe extern "C" fn __aeabi_uldivmod() { 
        
               core::arch::naked_asm!( 
        
                   "push {{r4, lr}}", 
        
                   "sub sp, sp, #16", 
        
                   "add r4, sp, #8", 
        
                   "str r4, [sp]", 
        
                   bl!("__udivmoddi4"), 
        
                   "ldr r2, [sp, #8]", 
        
                   "ldr r3, [sp, #12]", 
        
                   "add sp, sp, #16", 
        
                   "pop {{r4, pc}}", 
        
               ); 
        
           } 
        
           #[naked] 
        
           pub unsafe extern "C" fn __aeabi_idivmod() { 
        
               core::arch::naked_asm!( 
        
                   "push {{r0, r1, r4, lr}}", 
        
                   bl!("__aeabi_idiv"), 
        
                   "pop {{r1, r2}}", 
        
                   "muls r2, r2, r0", 
        
                   "subs r1, r1, r2", 
        
                   "pop {{r4, pc}}", 
        
               ); 
        
           } 
        
           #[naked] 
        
           pub unsafe extern "C" fn __aeabi_ldivmod() { 
        
               core::arch::naked_asm!( 
        
                   "push {{r4, lr}}", 
        
                   "sub sp, sp, #16", 
        
                   "add r4, sp, #8", 
        
                   "str r4, [sp]", 
        
                   bl!("__divmoddi4"), 
        
                   "ldr r2, [sp, #8]", 
        
                   "ldr r3, [sp, #12]", 
        
                   "add sp, sp, #16", 
        
                   "pop {{r4, pc}}", 
        
               ); 
        
           }

. The routine is so short though that I assume the overhead wouldn't be worth it.

So, feel free to submit a PR adding what you have there :) we don't need to add everything at once. edit: you beat me to it

AaronKutch · 2025-04-16T02:16:35Z

I want to mention that I've known that the assembly for my division functions would blow up into an absolute monstrosity the more levels of functions there are and the smaller the registers are. I oriented around delegating to half size divisions to make a larger division, but LLVM would inline nest things which would tend to quadratic code complexity. The way I would do it for 8 and maybe 16 bit microcontrollers, given enough time, is to use a common set of byte-slice-based functions for addition, multiplication, etc. The same division function can be used for everything from u16 to u128 by transmuting &uX to &[u8] and &mut uX to &mut [u8], then giving these to the common functions. This would give the smallest binary and stack size which is really the priority on these small devices. This is actually what my awint crate does, in fact it has special casing to use u8 digits on AVR to make it more than viable. However, in the case of division specifically, good 'ole binary division is the best approach considering that multiplication has do be done bit by bit anyway.

compiler-builtins/compiler-builtins/src/int/specialized_div_rem/binary_long.rs

Line 54 in 9978a8b

    
           // The following example is the most straightforward translation from the way binary

for example. Probably, the one good optimization to make is that there should be combined lhs += rhs << shift and lhs -= rhs << shift functions for implementing multiplication and division. You can brute-force test smaller cases to make sure they are correct, and rely on the fuzzing framework to check the rest.

tgross35 · 2025-04-16T23:20:24Z

For my understanding @Patryk27 @tones111, do you have any idea if more complex intrinsics tend to work on AVR since #791 should be in recent nightlies? Specifically ops involving i128, f128, and f16 which all tend to make ABI edge cases show up.

tones111 · 2025-04-16T23:29:04Z

The unsigned division intrinsics have merged!

The signed equivalents are more complicated as the semantics differs between languages. The book "Hacker's Delight" (2nd edition, chapter 9) describes three different implementations (truncating, modulus, or floor) and lists some languages that have made differing choices.

From what I can find the Rust reference says

"integer division rounds towards zero" and "remainder defined with truncating division
"Rust uses a remainder defined with truncating division. Given remainder = dividend % divisor, the remainder will have the same sign as the dividend.

Does the intrinsic need to implement Rust's division semantics or something else?

Also confusing is that Rust's Div trait says
"This operation will panic if other == 0 or the division results in overflow", however, it appears to give results for u8::MIN, which I would have expected to overflow. (playground)

tgross35 · 2025-04-17T05:10:07Z

Does the intrinsic need to implement Rust's division semantics or something else?

Usually language-specific semantics are implemented in the standard library, the builtins are closer to emulating behavior on real hardware As an example, the division routines are allowed to create UB if the divisor is 0.

The backends pick the actual semantics here, but LLVM and Cranelift almost always match GCC. It should act the same as __divmodsi4 and the other sizes that we have already.

Also confusing is that Rust's Div trait says
"This operation will panic if other == 0 or the division results in overflow", however, it appears to give results for u8::MIN, which I would have expected to overflow. (playground)

The linked example is i8, -128/-128 = 1 should be correct. Replacing i8 with u8 does panic with 0/0 as expected.

tones111 · 2025-04-17T20:36:22Z

After staring at __divmodqi4 for far too long I'm struggling to come up with anything sufficiently different from the gcc implementation.

Everything I've seen says to use an unsigned division and modify the sign of the results based on the sign of the inputs. Mathematically the gcc implementation is as straightforward as it gets.

compute and store the sign bits
convert to unsigned values
perform unsigned division
apply sign bits

At the cost of a few extra instructions I've packed the remainder sign into R0 instead of the T flag. There's no alternative algorithm for negation (implemented in the ALU). I'm unsure how to proceed.

    pub unsafe extern "C" fn __divmodqi4() {
        // compute signed 8-bit `n / d` and `n % d`.
        //
        // Note: GCC implements a [non-standard calling convention](https://gcc.gnu.org/wiki/avr-gcc#Exceptions_to_the_Calling_Convention) for this function.
        // Inputs:
        //     R24: dividend
        //     R22: divisor
        // Outputs:
        //     R24: quotient  (dividend / divisor)
        //     R25: remainder (dividend % divisor)
        // Clobbers:
        //     R23: loop counter
        //     R0:  sign bits
        //     T:   unused
        core::arch::naked_asm!(
            // This assembly routine adjusts the inputs to perform an unsigned division.
            // The quotient is negative when the dividend and divisor signs differ.
            // The remainder has the same sign as the dividend.
            "mov R0, R22",
            "eor R0, R24",  // R0.7 is the quotient sign
            "cbr R0, 0",    // R0.0 is the remainder sign

            "tst R24",
            "brpos 1f",     // if dividend is negative
            "neg R24",      //    negate to a positive value
            "sbr R0, 0",    //    set remainder sign
            "1:",

            "sbrc R22, 7",  // if divisor is negative
            "neg R22",      //    negate to a positive value

            "call __udivmodqi4",    // perform unsigned division
                                    // R24 = quotient, R25 = remainder
            "sbrc R0, 7",   // apply quotient sign
            "neg R24",

            "sbrc R0, 0",   // apply remainder sign
            "neg R25",
            "ret"
        );
    }

Patryk27 · 2025-04-17T21:00:56Z

I'm not a lawyer, but I don't think we have to be 100% different from GCC, it's mostly about "was the code copy-pasted verbatim" etc.; I'd guess that what you propose - assuming the algorithm is alright, haven't checked this one - is good.

@tgross35?

tgross35 · 2025-04-17T21:19:08Z

IIUC even beyond avoiding copy+paste you can't take inspiration from the source, and that's the part that matters rather than how different it is. Which is tough when you have read the source, I don't know what the safe way out of this is.

@joshtriplett is much more the expert here than me

tgross35 · 2025-04-17T21:23:09Z

It's might also be feasible to put all of this into a new file that is GPL and say that on AVR only you will be using GPL sources (already the case for any users that are linking GCC's runtime libs), which is nice because it would let you take the exact implementation. The downside is of course how the licensing changes propagate everywhere, I'm not sure what that would do to rust-lang/rust.

AaronKutch · 2025-04-18T01:59:21Z

The approach given is the only way I've ever known to do signed division in the many implementations I've made over the years. I don't think there is any licensing issue as long as you wrote the assembly from scratch. The only special cases are related to the iX::MIN / -1 case and the division-by-zero cases. For iX::MIN / -1, the right answer is iX::MIN which is given by using wrapping operations, which should be automatically given by using wrapping operations in the routine above (unless doing saturating or checked division in which case you have to special case it). For the division-by-zero cases, you probably want to implement the RISC-V behavior from #485 . https://five-embeddev.com/riscv-user-isa-manual/Priv-v1.12/m.html .

jfrimmel mentioned this issue Oct 16, 2024

Regression test for AVR rjmp offset rust-lang/rust#131755

Merged

tgross35 mentioned this issue Feb 19, 2025

Libcalls for Windows without SSE should return i128 on the stack llvm/llvm-project#127723

Open

Patryk27 mentioned this issue Mar 15, 2025

For msp430, __ashldi3 takes a 16-bit integer as the second argument in LLVM's compiler-rt. Probably affects other builtins. #520

Closed

Patryk27 mentioned this issue Mar 16, 2025

avr: Skip No More! #791

Merged

Patryk27 mentioned this issue Apr 4, 2025

Get rid of avr-libc dependency Rahix/avr-device#76

Open

tones111 mentioned this issue Apr 10, 2025

avr: implement __[u]divmod(h|q)i4 #815

Closed

tones111 mentioned this issue Apr 12, 2025

avr: implement unsigned division intrinsics #816

Merged

tones111 mentioned this issue Apr 20, 2025

roadmap to stable rust Rahix/avr-hal#471

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native support for AVR instrinsics #711

Native support for AVR instrinsics #711

Patryk27 commented Oct 14, 2024

tgross35 commented Oct 14, 2024 •

edited

Loading

Patryk27 commented Oct 14, 2024

tgross35 commented Oct 14, 2024 •

edited

Loading

tgross35 commented Oct 14, 2024

Patryk27 commented Oct 15, 2024

tgross35 commented Feb 5, 2025

tgross35 commented Mar 4, 2025

Patryk27 commented Mar 11, 2025

Patryk27 commented Mar 14, 2025

Patryk27 commented Mar 15, 2025 •

edited

Loading

Patryk27 commented Mar 15, 2025 •

edited

Loading

Patryk27 commented Mar 15, 2025 •

edited

Loading

tones111 commented Apr 10, 2025 •

edited

Loading

tgross35 commented Apr 10, 2025 •

edited

Loading

AaronKutch commented Apr 16, 2025

tgross35 commented Apr 16, 2025

tones111 commented Apr 16, 2025

tgross35 commented Apr 17, 2025

tones111 commented Apr 17, 2025 •

edited

Loading

Patryk27 commented Apr 17, 2025

tgross35 commented Apr 17, 2025

tgross35 commented Apr 17, 2025

AaronKutch commented Apr 18, 2025 •

edited

Loading

Native support for AVR instrinsics #711

Native support for AVR instrinsics #711

Comments

Patryk27 commented Oct 14, 2024

tgross35 commented Oct 14, 2024 • edited Loading

Patryk27 commented Oct 14, 2024

tgross35 commented Oct 14, 2024 • edited Loading

tgross35 commented Oct 14, 2024

Patryk27 commented Oct 15, 2024

tgross35 commented Feb 5, 2025

tgross35 commented Mar 4, 2025

Patryk27 commented Mar 11, 2025

Patryk27 commented Mar 14, 2025

Patryk27 commented Mar 15, 2025 • edited Loading

Patryk27 commented Mar 15, 2025 • edited Loading

Patryk27 commented Mar 15, 2025 • edited Loading

tones111 commented Apr 10, 2025 • edited Loading

tgross35 commented Apr 10, 2025 • edited Loading

AaronKutch commented Apr 16, 2025

tgross35 commented Apr 16, 2025

tones111 commented Apr 16, 2025

tgross35 commented Apr 17, 2025

tones111 commented Apr 17, 2025 • edited Loading

Patryk27 commented Apr 17, 2025

tgross35 commented Apr 17, 2025

tgross35 commented Apr 17, 2025

AaronKutch commented Apr 18, 2025 • edited Loading

tgross35 commented Oct 14, 2024 •

edited

Loading

tgross35 commented Oct 14, 2024 •

edited

Loading

Patryk27 commented Mar 15, 2025 •

edited

Loading

Patryk27 commented Mar 15, 2025 •

edited

Loading

Patryk27 commented Mar 15, 2025 •

edited

Loading

tones111 commented Apr 10, 2025 •

edited

Loading

tgross35 commented Apr 10, 2025 •

edited

Loading

tones111 commented Apr 17, 2025 •

edited

Loading

AaronKutch commented Apr 18, 2025 •

edited

Loading