Skip to content

SPI Bus performance loss of ~9% under Python moving from Kernel 6.6.31 -> 6.12.25 #6854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Roland-Smith-Imperial opened this issue May 14, 2025 · 15 comments

Comments

@Roland-Smith-Imperial
Copy link

Roland-Smith-Imperial commented May 14, 2025

Describe the bug

I am using python (3.11.2) running on a Pi5 to drive an AD7606 ADC with a loop that (1) triggers a hardware sample event using a pulse from a GPIO line (libgpio 1.6.3-1) (2) waits 3-uSec (the hardware acquisition time) followed by (3) read back of 12 bit ADC data via the SPI bus formatted as a pair of int16 values using the spidev (3.6-1) xfer2 function followed by (4) recording a time-stamp using time.perf_counter_ns.

Updating from RPi OS kernel version 6.6.31 -> 6.12.25 the average data rate drops from 35.5 -> 32.5 kHz, a ~9% performance hit. The culprit seems to be the SPI read using spidev's xfer2 function, that step increases from an average of 21.6 -> 24.3 uSec, along with a significant in increase in timing jitter.

Reverting back to 6.6.31 recovers the performance.

Stripping the python code back to a minimalist loop that only runs the spidev xfer2 call (with no hardware in place) shows the same performance drop. Plotting a histogram of the time taken for each sample loop also highlights much more loop-to-loop timing jitter with the 6.12.25 kernel.

Demonstration Python code below.

Steps to reproduce the behaviour

Steps to reproduce the behaviour*

  1. Run the python code below to benchmark the performance of the SPI bus using spidev xfer2 on an RPi5 with a 6.6.31 OS Kernel. No additional hardware required.

  2. Upgrade the Kernel to 6.12.25 (no other packages updates).

  3. Re-run the python code to benchmark the performance of spidev xfer2 and compare before / after average loop time and loop-to-loop timing jitter.

########### Test code ############

import spidev                    # SPI bus IO tools
import struct                     # Conversion from 8-bit pairs to int16
import time                       # Access clock and sleep delays
import numpy as np         # Maths tools
import matplotlib.pyplot as plt   # Plotting functions, data display
import copy

from time import perf_counter_ns  # Access ns resolution timer
from struct import unpack

# SPI Bus set up and read back tools.
# Will run with no hardware connected.

class ADC_Pi(object):
    spiADC = spidev.SpiDev()          # Link to SPI bus and init
    spiADC.open(0, 0)                 # Expects SPI bus at dev/SPI(0,0)
    spiADC.max_speed_hz =   (8000000) # Fails to read back correctly at >= 10,000,000


    def read_adc_raw_2x(self):        #  Read 4x raw 8 bit words, return 2x int16 values
        raw_in = self.spiADC.xfer2([0, 0, 0, 0])
        return struct.unpack_from('>h', bytes(raw_in), offset = 0)[0], \
        struct.unpack_from('>h', bytes(raw_in), offset = 2)[0]


def adc_speed_test() :  # Test speed of SPI read back
    adc = ADC_Pi()
    loops = 100000
    loop_counter = 0
    start_time = perf_counter_ns()
    while loop_counter < loops:
        adc_return = adc.read_adc_raw_2x()
        loop_counter = loop_counter + 1
    run_time = perf_counter_ns() - start_time
    print("Average SPI 2x Read  = %4.2f" % ((run_time / loops) / 1000), "uSec")
    

#                Data acquisition and analysis


def capture_data_pairs(samples):  # Capture N readings of 16-bit data and time stamp
    samples = max(3, samples)      # Min 3 samples 
    samples = int(samples)
    
    print(" ")
    print("Capturing", samples, "16-bit ADC samples" )
    adc = ADC_Pi()               # create instance of ADC_Pi    

    adc_data_1 = np.zeros(samples, dtype=np.int32)  # Make data storage arrays
    adc_data_2 = np.zeros(samples, dtype=np.int32)
    time_data  = np.zeros(samples, dtype=np.int64)
    start_time = perf_counter_ns()

    for i in range(samples):                       # Populate 1D arrays of N samples + time stamps
        time_data[i]  = perf_counter_ns()    # Save precision timestamp
        adc_data_1[i], adc_data_2[i] = adc.read_adc_raw_2x()  # Dual channel read
        pass  

    total_time = 1E-9 * (perf_counter_ns() - start_time)  # Total run time.
    time_data = time_data - time_data[0]                      # Set time stamp increment from zero.
    time_interval = time_data[2:samples - 1] - time_data[1:samples - 2]
    print("Returning Ch1, Ch2 16-bit data and time arrays")
    print("Time taken (Sec)    =   %4.2f" % (total_time))
    print("Average Sample Rate =   %4.2f" % (samples / (1000 * total_time)),"kHz")
    print("Mean time interval  =   %4.2f +/- %2.1f" % (1E-3 * np.mean(time_interval), 1E-3 * np.std(time_interval)), "uSec")

    return(adc_data_1, adc_data_2, time_data)   


def time_histogram_2(time_data):   # Histogram of sample time interval
    samples = len(time_data)
    time_interval =  time_data[1:] - time_data[:(samples - 1)]  # Calculate time intervals
    time_interval =  time_interval[1:(samples-2)]               # Trim off last (bad) interval
    time_interval =  time_interval / 1E3                        # Convert to u-sec

    min_time_interval = min(time_interval[0:])
    max_time_interval = max(time_interval[0:])
    run_time = (time_data[samples-1] - time_data[0]) * 1E-9
    average_time_interval = run_time / samples
    sample_rate = (samples / run_time) / 1000 # Rate in kHz
    
    mean_time_interval = np.mean(time_interval)
    mean_time_sd = np.std(time_interval)
 
    fig = plt.figure(figsize=(10, 6), dpi = 140, layout="constrained")
    spec = fig.add_gridspec(5, 4)

    params = {                        # Set up plot
        'figure.figsize': [10, 4],    # Fig size, width, height
        'axes.labelsize': 14,
        'font.size': 10,
        } 
    plt.rcParams.update(params)

    fig.add_subplot(spec[0:-1, :2])
    plt.ylabel("Number of samples")      # set the y-label
    plt.xlabel("Time Interval / uSec")   # set the x-label
    counts, bins, junk = plt.hist(time_interval, bins = 100, range=[min_time_interval,  min_time_interval*1.15])   # create a histogram of time intervals
#    counts, bins, junk = plt.hist(time_interval, bins = 100, range=[33, 36])   # create a histogram of time intervals
    bincentres = 0.5*(bins[1:] + bins[:-1])
    
    fig.add_subplot(spec[0:-1, 2:])      # Log plot of histogram data
    plt.ylabel("Log Number of samples")  # set the y-label
    plt.xlabel("Time Interval / uSec")   # set the x-label
    counts, bins, junk = plt.hist(time_interval, bins = 100)   # create a histogram of time intervals
    bincentres = 0.5*(bins[1:] + bins[:-1])
    plt.yscale('log')                    # Plot y scale as Log

    fig.add_subplot(spec[4, :]) # Lower, long (text box)
    
    plt.text(-0.05, 0.8, (str(samples / 1000)) + "k Samples over %3.1f Seconds" % (run_time), fontsize = 12)
    plt.text(-0.05, 0.55, "Average Sample Rate =  %3.2f kHz " % (sample_rate),  fontsize = 12)
    plt.text(-0.05, 0.30, "Mean Time Interval     =   %3.1f  +/-  %3.1f uSec." % (mean_time_interval, mean_time_sd), fontsize = 12)

    plt.text( 0.5 , 0.8, "Min Time Interval      =  %3.1f uSec" % (min_time_interval), fontsize = 12)
    plt.text( 0.5 , 0.55, "Max Time Interval      =  %3.1f uSec" % (max_time_interval), fontsize = 12)

    plt.axis('off')       # Turn off X, Y axis labels and plot box
    fig.suptitle(str(samples / 1000) + "k  Samples - Time Histogram Analysis. ")
    plt.show()

# Run tests

adc_speed_test()

aa, bb, cc = capture_data_pairs(100000)

time_histogram_2(cc)

Device (s)

Raspberry Pi CM5

System

"Problem" state post update.

cat /etc/rpi-issue
Raspberry Pi reference 2024-03-15
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, f19ee211ddafcae300827f953d143de92a5c6624, stage4

vcgencmd version
2025/05/08 15:13:17
Copyright (c) 2012 Broadcom
version 69471177 (release) (embedded)

uname -a
Linux raspberrypi 6.12.25+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.12.25-1+rpt1 (2025-04-30) aarch64 GNU/Linux

Initial "problem free" state before update.

cat /etc/rpi-issue
Raspberry Pi reference 2024-03-15
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, f19ee211ddafcae300827f953d143de92a5c6624, stage4

vcgencmd version
2025/05/08 15:13:17
Copyright (c) 2012 Broadcom
version 69471177 (release) (embedded)

uname -a
Linux raspberrypi 6.6.31+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

Logs

NA

Additional context

We are aiming for "fast" low jitter performance of our ADC system so the drop in performance with a new kernel was a bit of a surprise. All versions of the 6.12.XX kernel tested seem to ahve the same issue. Running python code bound to a single reserved core (import psutil ; p = psutil.Process() ; p.cpu_affinity([3] ) helps slightly with overall speed, but reduces loop-to-loop jitter rather nicely. However this and other minor tweaks, the GPIO hardware trigger, hardware settling time delay etc were unaffected by the kernel update and were all stripped out for the before / after testing code.

@Roland-Smith-Imperial
Copy link
Author

Oops, apologies for the messy formatting of that python code !

@timg236
Copy link
Contributor

timg236 commented May 14, 2025

It might be worth trying the experimental PREEMPT_RT variant of the RPi downstream 6.12 kernel to see if that reduces jitter.

To install the kernel run this

sudo WANT_64BIT_RT=1 rpi-update rpi-6.12.y
and add the following toconfig.txt

kernel=kernel8_rt.img
auto_initramfs=0

@Roland-Smith-Imperial
Copy link
Author

Hi Tim

Many thanks for the suggestion, I gave the PREEMPT_RT kernel variant a go as suggested above, it is still a slight "downgrade" in terms of average sample rate (and has a marginally better inter-sample distribution "width wise" but that also looks a little "lumpy", histograms to follow.

Running our ADC sampling code, bound to a reserved CPU 3 as "sudo chrt 99 python3 our_code.py" to set it to a high RT priority gives :-

6.6.31 -> 36.7 kHz average sample rate over 1E6 samples
Nice tight Gaussian distribution to the inter-sample interval, average inter-sample time 27.2 +/- 0.2 uSec, max 84 uSec

6.12.28 -> 34.76 average sample rate over 1E6 samples
Slightly broader Gaussian distribution to the inter-sample interval, average inter-sample time 28.8 +/- 0.2 uSec, max 47.3 uSec

So still a small 5% speed hit 6.6.31 versus 6.12.28 RT (but definitly better than the non RT 6.12.28 tests), and a slight win on clamping the maximum inter-sample interval, with that falling from 84 uSec to 47 uSec.

Any other suggestions greatfully received, otherwise sticking with 6.6.31 is probably our best bet for now.

Cheers

@Roland-Smith-Imperial
Copy link
Author

Image
Image

@Roland-Smith-Imperial
Copy link
Author

Histogram for a "standard" 6.12.28 kernel in non-RT mode below for comparison.

@Roland-Smith-Imperial
Copy link
Author

Image

@timg236
Copy link
Contributor

timg236 commented May 15, 2025

Since SPI is on RP1 the system is exposed to latency/jitter from PCIe. I'm not aware of any major differences in PCIe setup between 6.6 and 6.12 but others might comment if they know otherwise.
For GPIO the RP1 PIO interface can be used to avoid this, I guess it's theoretically possible to use the same approach here to to continuously read from the ADCs but I don't think there is a nice reusable demo for that.

@P33M
Copy link
Contributor

P33M commented May 15, 2025

The non-RT latency tail is far too big for PCIe DMA/link effects to be significant contributors.

With short transfer sizes such as these, the hardware is used in interrupt-driven FIFO mode, so one potential source of regression is a kernel task that disables interrupts for longer than it did on 6.6

In RT kernels, almost all mechanisms that disable hardware interrupts are abstracted into a complicated pile of thread-like primitives, so a greedy neighbour cannot prevent higher priority tasks from running.

@pelwell
Copy link
Contributor

pelwell commented May 15, 2025

We think this may be caused by the addition of resource locking in the use of GPIO descriptors to prevent them being freed while in use, a change that appeared in the 6.9 kernel. This would affect performance for SPI because of the use of software controller/GPIO chip select lines. This is a feature we enabled for older Pis because it is more flexible than the rather restricted CS support implemented by the BCM2835 SPI hardware. It also allows the use of arbitrary GPIOs, rather than being restricted to those where the SPI CS alternate functions are mapped.

The SPI controllers in RP1 do support HW CS so it might be possible to use it instead. It would require Device Tree changes to indicate to the SPI subsystem that we wish to do so, and to enable the correct alternate functions on the CS pins. However, it's not a feature that has been used much, and some of the comments in the driver make me think it might be a bit quirky.

@popcornmix
Copy link
Collaborator

This is the upstream change that may increase overhead when using gpio: https://lwn.net/Articles/960024

@Roland-Smith-Imperial
Copy link
Author

Many thanks to popcornmix, pelwell, P33M and timg236 for all your insights much appreciated. It sounds like the (minor) slow down we see is the result of some pretty deep under the hood tweaks to avoid gpio users colliding / breaking things, good to know there was a specific cause rather than "bad code". For our application sticking with 6.6.31 for now is I think the way to go as we want to maximise speed and only have one application alternating between SPI and GPIO. Best wishes to all.

@P33M
Copy link
Contributor

P33M commented May 16, 2025

The SPI controllers in RP1 do support HW CS so it might be possible to use it instead.

The hardware chipselects are badly implemented. If the TX FIFO empties mid-transfer, the hardware deasserts CS. A couple of other silicon integrators have workarounds in hardware which basically reimplement GPIO-CS functionality inside the SPI hardware block.

https://github.com/raspberrypi/linux/blob/rpi-6.12.y/drivers/spi/spi-dw-mmio.c#L151

@pelwell
Copy link
Contributor

pelwell commented May 16, 2025

Indeed - the comment before that function was one of those I referred to.

I can't help thinking that if your resource lifetime management requires interlocking on every single access, however brief, then you're doing it wrong. Surely one of the great things about descriptors is that they are explicitly claimed and released, and if there is a valid descriptor then the underlying gpio chip object (and its driver) should not be deleted.

pelwell added a commit to pelwell/linux that referenced this issue May 19, 2025
The 6.10 kernel introduced a large patchset designed to make gpiochip
usage safe in the presence of potential hotplugging events. The changes
included protecting every gpiochip access with a claim of an interlock.

Running on a Pi 5 these changes reduce GPIO performance from userspace
by around 10%. The penalty would be proportionally higher from kernel,
as seen by SPI speed reductions.

Patch the gpiolib implementation to remove the protection of gpiochip
accesses. By providing alternative implementations of the relevant
macros, the changes are localised and therefore easier to verify.

See: raspberrypi#6854
     https://lwn.net/Articles/960024/

Signed-off-by: Phil Elwell <[email protected]>
pelwell added a commit to pelwell/linux that referenced this issue May 19, 2025
The 6.9 kernel introduced a large patchset [1] designed to make gpiochip
usage safe in the presence of potential hotplugging events. The changes
included protecting every gpiochip access with a claim of an interlock.

Running on a Pi 5 these changes reduce GPIO performance from userspace
by around 10%. The penalty would be proportionally higher from kernel,
as seen by SPI speed reductions.

Patch the gpiolib implementation to remove the protection of gpiochip
accesses. By providing alternative implementations of the relevant
macros, the changes are localised and therefore easier to verify.

See: raspberrypi#6854

[1] https://lwn.net/Articles/960024/

Signed-off-by: Phil Elwell <[email protected]>
@pelwell pelwell reopened this May 19, 2025
@pelwell
Copy link
Contributor

pelwell commented May 19, 2025

#6862 is a trial patch to reduce (remove?) the overhead of the GPIO chip safety measures. In about 40 minutes you'll be able to install a trial build of it by running sudo rpi-update pulls/6862. Note that this is very beta software, and any important data should be backed up before installing it.

@Roland-Smith-Imperial
Copy link
Author

#6862 is a trial patch to reduce (remove?) the overhead of the GPIO chip safety measures.

Many thanks, that's very helpful, happy to give some beta software a go, warnings on data backup noted :).

pelwell added a commit that referenced this issue May 21, 2025
The 6.9 kernel introduced a large patchset [1] designed to make gpiochip
usage safe in the presence of potential hotplugging events. The changes
included protecting every gpiochip access with a claim of an interlock.

Running on a Pi 5 these changes reduce GPIO performance from userspace
by around 10%. The penalty would be proportionally higher from kernel,
as seen by SPI speed reductions.

Patch the gpiolib implementation to remove the protection of gpiochip
accesses. By providing alternative implementations of the relevant
macros, the changes are localised and therefore easier to verify.

See: #6854

[1] https://lwn.net/Articles/960024/

Signed-off-by: Phil Elwell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants