|
| 1 | +# FFT benchmarks for Intel(R) Distribution for Python\* |
| 2 | + |
| 3 | +This set of benchmarks measures performance of FFT computations, serving to |
| 4 | +highlight performance improvements to FFT computations in NumPy and SciPy in |
| 5 | +the Intel(R) Distribution for Python\*. We provide both Python and native |
| 6 | +(MKL DFTI) implementations of these benchmarks with similar command-line |
| 7 | +interfaces. |
| 8 | + |
| 9 | +## Python benchmarks |
| 10 | + |
| 11 | +To reproduce, install Intel(R) Distribution for Python\* as follows: |
| 12 | + |
| 13 | +```bash |
| 14 | +conda create -n 'idp3_fft' -c intel numpy scipy |
| 15 | +conda activate idp3_fft |
| 16 | +``` |
| 17 | + |
| 18 | +To benchmark FFT in Python, execute |
| 19 | + |
| 20 | +```bash |
| 21 | +python fft_bench.py [-h] [args] size |
| 22 | +``` |
| 23 | + |
| 24 | +The methodology is to perform one unmeasured computation, and then repeat 24 |
| 25 | +total timings for 16 repetitions of FFT computations in the loop. The 24 |
| 26 | +measurements are aggregated to report minimum, median and maximum timings, |
| 27 | +which are printed to STDOUT. |
| 28 | + |
| 29 | +Other printed lines which start with 'TAG: ' are printed for information only, |
| 30 | +and can be filtered out if need be. |
| 31 | + |
| 32 | +### Examples |
| 33 | + |
| 34 | +Benchmark a 2D out-of-place FFT of a `complex128` array of size `(10000, |
| 35 | +10000)`: |
| 36 | +``` |
| 37 | +python fft_bench.py 10000x10000 |
| 38 | +``` |
| 39 | + |
| 40 | +Benchmark a 1D in-place FFT of a `float32` array of size `100000000`, print |
| 41 | +only 5 measurements, only compute the first half of the conjugate-even |
| 42 | +DFT coefficients, and allow the FFT backend to only use one thread: |
| 43 | +``` |
| 44 | +python fft_bench.py -P -r -t 1 -d float32 -o 5 100000000 |
| 45 | +``` |
| 46 | + |
| 47 | +Benchmark a 3D in-place FFT of a `complex64` array of size `1001x203x3005`, |
| 48 | +printing only 5 measurements, each of which average over 24 inner loop |
| 49 | +computations: |
| 50 | +``` |
| 51 | +python fft_bench.py -P -d complex64 -o 5 -i 24 1001x203x3005 |
| 52 | +``` |
| 53 | + |
| 54 | +## Native benchmarks |
| 55 | + |
| 56 | +### Compiling on Linux |
| 57 | +- To compile, source compiler and run `make`. |
| 58 | +- Run with `./fft_bench`. |
| 59 | + |
| 60 | +### Compiling on Windows |
| 61 | +- Source compiler and MKL, then run `win_compile_all.bat`. |
| 62 | + ``` |
| 63 | + > "C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\bin\compilervars.bat intel64" |
| 64 | + > "C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl\bin\mklvars.bat intel64" |
| 65 | + > win_compile_all.bat |
| 66 | + ``` |
| 67 | +- To run, run `fft_bench.exe`. Note that long options are not supported on |
| 68 | + Windows. Use short options instead. |
| 69 | + |
| 70 | +### Examples |
| 71 | + |
| 72 | +Benchmark a 2D out-of-place FFT of a `complex128` array of size `(10000, |
| 73 | +10000)`: |
| 74 | +``` |
| 75 | +./fft_bench 10000x10000 |
| 76 | +``` |
| 77 | + |
| 78 | +Benchmark a 1D in-place FFT of a `float32` array of size `100000000`, print |
| 79 | +only 5 measurements, only compute the first half of the conjugate-even |
| 80 | +DFT coefficients, and allow the FFT backend to only use one thread: |
| 81 | +``` |
| 82 | +./fft_bench -P -r -t 1 -d float32 -o 5 100000000 |
| 83 | +``` |
| 84 | + |
| 85 | +Benchmark a 3D in-place FFT of a `complex64` array of size `1001x203x3005`, |
| 86 | +printing only 5 measurements, each of which average over 24 inner loop |
| 87 | +computations: |
| 88 | +``` |
| 89 | +./fft_bench -P -d complex64 -o 5 -i 24 1001x203x3005 |
| 90 | +``` |
| 91 | + |
| 92 | +### Usage |
| 93 | + |
| 94 | +``` |
| 95 | +usage: ./fft_bench [args] size |
| 96 | +Benchmark FFT using Intel(R) MKL DFTI. |
| 97 | +
|
| 98 | +FFT problem arguments: |
| 99 | + -t, --threads=THREADS use THREADS threads for FFT execution |
| 100 | + (default: use MKL's default) |
| 101 | + -d, --dtype=DTYPE use DTYPE as the FFT domain. For a list of |
| 102 | + understood dtypes, use '-d help'. |
| 103 | + (default: complex128) |
| 104 | + -r, --rfft do not copy superfluous harmonics when FFT |
| 105 | + output is even-conjugate, i.e. for real inputs |
| 106 | + -P, --in-place allow overwriting the input buffer with the |
| 107 | + FFT outputs |
| 108 | + -c, --cached use the same DFTI descriptor for the same |
| 109 | + outer loop, i.e. "cache" the descriptor |
| 110 | +
|
| 111 | +Timing arguments: |
| 112 | + -i, --inner-loops=IL time the benchmark IL times for each printed |
| 113 | + measurement. Copies are not included in the |
| 114 | + measurements. (default: 16) |
| 115 | + -o, --outer-loops=OL print OL measurements. (default: 5) |
| 116 | +
|
| 117 | +Output arguments: |
| 118 | + -p, --prefix=PREFIX output PREFIX as the first value in outputs |
| 119 | + (default: 'Native-C') |
| 120 | + -H, --no-header do not output CSV header. This can be useful |
| 121 | + if running multiple benchmarks back-to-back. |
| 122 | + -h, --help print this message and exit |
| 123 | +
|
| 124 | +The size argument specifies the input matrix size as a tuple of positive |
| 125 | +decimal integers, delimited by any non-digit. For example, both |
| 126 | +(101, 203, 305) and 101x203x305 denote the same 3D FFT. |
| 127 | +``` |
| 128 | + |
| 129 | +## See also |
| 130 | +"[Accelerating Scientific Python with Intel |
| 131 | +Optimizations](http://conference.scipy.org/proceedings/scipy2017/pdfs/oleksandr_pavlyk.pdf)" |
| 132 | +by Oleksandr Pavlyk, Denis Nagorny, Andres Guzman-Ballen, Anton Malakhov, Hai |
| 133 | +Liu, Ehsan Totoni, Todd A. Anderson, Sergey Maidanov. Proceedings of the 16th |
| 134 | +Python in Science Conference (SciPy 2017), July 10 - July 16, Austin, Texas |
0 commit comments