Skip to content

Replace loader handles with field at start of handle data #2622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

RossBrunton
Copy link
Contributor

Currently only works for L0 (v1) and Hip.

@github-actions github-actions bot added loader Loader related feature/bug level-zero L0 adapter specific issues hip HIP adapter specific issues command-buffer Command Buffer feature addition/changes/specification labels Jan 27, 2025
@github-actions github-actions bot added cuda CUDA adapter specific issues native-cpu Native CPU adapter specific issues labels Jan 27, 2025
Copy link
Contributor

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12995136466

@github-actions github-actions bot added the common Changes or additions to common utilities label Jan 27, 2025
Copy link
Contributor

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/12995136466
Job status: success. Test status: success.

Summary

Total 38 benchmarks in mean.
Geomean 100.206%.
Improved 6 Regressed 7 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group memory (4): 100.808%
Benchmark This PR baseline Relative perf Change -
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.825000 μs 5.932 μs 101.84% 1.84% .
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 255.579000 μs 256.472 μs 100.35% 0.35% .
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 218.665000 μs 219.201 μs 100.25% 0.25% .
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 - 3.074000 GB/s
Relative perf in group api (12): 101.685%
Benchmark This PR baseline Relative perf Change -
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.131000 μs 2.175 μs 102.06% 2.06% ++
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.684000 μs 1.706 μs 101.31% 1.31% .
api_overhead_benchmark_l0 SubmitKernel out of order - 11.629000 μs
api_overhead_benchmark_l0 SubmitKernel in order - 11.800000 μs
api_overhead_benchmark_sycl SubmitKernel out of order - 23.287000 μs
api_overhead_benchmark_sycl SubmitKernel in order - 24.664000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count - 105463.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order - 16.073000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count - 110815.000000 instr
api_overhead_benchmark_ur SubmitKernel in order - 16.703000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count - 123991.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion - 21.473000 μs
Relative perf in group Velocity-Bench (9): 99.170%
Benchmark This PR baseline Relative perf Change -
Velocity-Bench dl-mnist 2.410 s 2.390000 s 99.17% -0.83% .
Velocity-Bench Hashtable - 353.884706 M keys/sec
Velocity-Bench Bitcracker - 35.731600 s
Velocity-Bench CudaSift - 204.632000 ms
Velocity-Bench Easywave - 235.000000 ms
Velocity-Bench QuickSilver - 118.320000 MMS/CTT
Velocity-Bench Sobel Filter - 615.149000 ms
Velocity-Bench dl-cifar - 23.892100 s
Velocity-Bench svm - 0.140700 s
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 98.331%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2090.550000 ns 2119.200 ns 101.37% 1.37% .
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2695.080000 ns 2723.560 ns 101.06% 1.06% .
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 305.752 ns 294.824000 ns 96.43% -3.57% ---
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3301.340 ns 3124.490000 ns 94.64% -5.36% ----
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 99.292%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 192.902000 ns 195.800 ns 101.50% 1.50% .
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 213.998 ns 213.357000 ns 99.70% -0.30% .
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 708.029 ns 699.961000 ns 98.86% -1.14% .
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 277.730 ns 269.830000 ns 97.16% -2.84% --
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 102.184%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1234.090000 ns 1399.010 ns 113.36% 13.36% ++++++++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1872.920000 ns 1896.370 ns 101.25% 1.25% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 261.410 ns 260.987000 ns 99.84% -0.16% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3345.870 ns 3183.170000 ns 95.14% -4.86% ----
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 98.976%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 190.629000 ns 192.753 ns 101.11% 1.11% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 204.159000 ns 204.412 ns 100.12% 0.12% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 746.724 ns 737.865000 ns 98.81% -1.19% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 323.599 ns 310.425000 ns 95.93% -4.07% ---
Relative perf in group alloc/min (4): 100.564%
Benchmark This PR baseline Relative perf Change -
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1031.030000 ns 1083.760 ns 105.11% 5.11% ++++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 963.666 ns 960.784000 ns 99.70% -0.30% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 175.796 ns 174.373000 ns 99.19% -0.81% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 814.890 ns 801.763000 ns 98.39% -1.61% .
Relative perf in group multiple (12): 100.475%
Benchmark This PR baseline Relative perf Change -
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25725.600000 ns 27465.300 ns 106.76% 6.76% +++++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1168990.000000 ns 1201570.000 ns 102.79% 2.79% ++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 30426.700000 ns 31243.600 ns 102.68% 2.68% ++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 33868.700000 ns 34482.100 ns 101.81% 1.81% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 14839.400000 ns 15099.900 ns 101.76% 1.76% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 42798.400000 ns 43475.800 ns 101.58% 1.58% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 148110.000 ns 147271.000000 ns 99.43% -0.57% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 76090.100 ns 75587.500000 ns 99.34% -0.66% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 142794.000 ns 141214.000000 ns 98.89% -1.11% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4271.520 ns 4207.320000 ns 98.50% -1.50% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1218490.000 ns 1185020.000000 ns 97.25% -2.75% --
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 168045.000 ns 160292.000000 ns 95.39% -4.61% ---
Relative perf in group miscellaneous (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
miscellaneous_benchmark_sycl VectorSum - 861.253000 bw GB/s
Relative perf in group multithread (10): cannot calculate
Benchmark This PR baseline Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 - 6943.025000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 - 17230.283000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 - 47306.654000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 - 2083.870000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 - 7821.718000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 - 9073.725000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 - 26707.698000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 - 1210.999000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events - 43064.999000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events - 114139.645000 μs
Relative perf in group graph (10): cannot calculate
Benchmark This PR baseline Relative perf Change -
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 - 71856.495000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 - 72543.241000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 - 353404.211000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 - 353223.514000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 - 54.135000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 - 61.889000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 - 679.477000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 - 5611.771000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 - 5615.778000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 - 57263.652000 μs
Relative perf in group Runtime (8): cannot calculate
Benchmark This PR baseline Relative perf Change -
Runtime_IndependentDAGTaskThroughput_SingleTask - 265.060000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor - 287.518000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor - 277.037000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor - 275.224000 ms
Runtime_DAGTaskThroughput_SingleTask - 1678.531000 ms
Runtime_DAGTaskThroughput_BasicParallelFor - 1747.525000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor - 1718.971000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor - 1682.917000 ms
Relative perf in group MicroBench (14): cannot calculate
Benchmark This PR baseline Relative perf Change -
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous - 4.832000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous - 4.730000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous - 4.690000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous - 4.764000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous - 618.120000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous - 618.122000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided - 4.700000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided - 5.130000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided - 5.024000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided - 4.854000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided - 617.529000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided - 617.480000 ms
MicroBench_LocalMem_int32_4096 - 29.887000 ms
MicroBench_LocalMem_fp32_4096 - 29.884000 ms
Relative perf in group Pattern (10): cannot calculate
Benchmark This PR baseline Relative perf Change -
Pattern_Reduction_NDRange_int32 - 16.720000 ms
Pattern_Reduction_Hierarchical_int32 - 16.716000 ms
Pattern_SegmentedReduction_NDRange_int16 - 2.266000 ms
Pattern_SegmentedReduction_NDRange_int32 - 2.164000 ms
Pattern_SegmentedReduction_NDRange_int64 - 2.338000 ms
Pattern_SegmentedReduction_NDRange_fp32 - 2.165000 ms
Pattern_SegmentedReduction_Hierarchical_int16 - 11.799000 ms
Pattern_SegmentedReduction_Hierarchical_int32 - 11.588000 ms
Pattern_SegmentedReduction_Hierarchical_int64 - 11.784000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 - 11.585000 ms
Relative perf in group ScalarProduct (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
ScalarProduct_NDRange_int32 - 3.769000 ms
ScalarProduct_NDRange_int64 - 5.461000 ms
ScalarProduct_NDRange_fp32 - 3.773000 ms
ScalarProduct_Hierarchical_int32 - 10.533000 ms
ScalarProduct_Hierarchical_int64 - 11.502000 ms
ScalarProduct_Hierarchical_fp32 - 10.158000 ms
Relative perf in group USM (7): cannot calculate
Benchmark This PR baseline Relative perf Change -
USM_Allocation_latency_fp32_device - 0.067000 ms
USM_Allocation_latency_fp32_host - 37.342000 ms
USM_Allocation_latency_fp32_shared - 0.057000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch - 1.684000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch - 1.074000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch - 1.850000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch - 1.256000 ms
Relative perf in group VectorAddition (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
VectorAddition_int32 - 1.475000 ms
VectorAddition_int64 - 3.061000 ms
VectorAddition_fp32 - 1.468000 ms
Relative perf in group Polybench (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
Polybench_2mm - 1.227000 ms
Polybench_3mm - 1.729000 ms
Polybench_Atax - 6.885000 ms
Relative perf in group Kmeans (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
Kmeans_fp32 - 16.080000 ms
Relative perf in group LinearRegressionCoeff (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
LinearRegressionCoeff_fp32 - 935.779000 ms
Relative perf in group MolecularDynamics (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
MolecularDynamics - 0.029000 ms
Relative perf in group llama.cpp (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
llama.cpp Prompt Processing Batched 128 - 829.272674 token/s
llama.cpp Text Generation Batched 128 - 62.469368 token/s
llama.cpp Prompt Processing Batched 256 - 867.896489 token/s
llama.cpp Text Generation Batched 256 - 62.451865 token/s
llama.cpp Prompt Processing Batched 512 - 428.586901 token/s
llama.cpp Text Generation Batched 512 - 62.506870 token/s

Details

Benchmark details - environment, command...
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

alloc/size:10000/0/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

@RossBrunton RossBrunton force-pushed the ross/nohandle branch 6 times, most recently from a5e38c1 to 3d54672 Compare January 30, 2025 14:20
@github-actions github-actions bot added the opencl OpenCL adapter specific issues label Jan 30, 2025
@RossBrunton RossBrunton force-pushed the ross/nohandle branch 6 times, most recently from 3bfdda6 to 3c26247 Compare January 31, 2025 12:36
@RossBrunton RossBrunton force-pushed the ross/nohandle branch 2 times, most recently from 6387d75 to a8af4b1 Compare February 12, 2025 17:14
@martygrant
Copy link
Contributor

Unified Runtime -> intel/llvm Repo Move Notice

Information

The source code of Unified Runtime has been moved to intel/llvm under the unified-runtime top-level directory,
all future development will now be carried out there. This was done in intel/llvm#17043.

The code will be mirrored to oneapi-src/unified-runtime and the specification will continue to be hosted at oneapi-src.github.io/unified-runtime.

The contribution guide has been updated with new instructions for contributing to Unified Runtime.

PR Migration

All open PRs including this one will be labelled auto-close and shall be automatically closed after 30 days.
To allow for some breathing space, this automation will not be enabled until next week (27/02/2025).

Should you wish to continue with your PR you will need to migrate it to intel/llvm.
We have provided a script to help automate this process.


This is an automated comment.

We were reading the kernel arguments at kernel execution time, but
kernel arguments are allowed to change between enqueuing and executing.
Make sure to create a copy of kernel arguments ahead of time.

This was previously approved as a unified-runtime PR:
oneapi-src#2700
zhaomaosu and others added 15 commits February 21, 2025 14:56
Update UMF to the commit:
```
    commit 5a515c56c92be75944c8246535c408cee7711114
    Author: Lukasz Dorau <[email protected]>
    Date:   Mon Feb 17 10:56:05 2025 +0100
    Merge pull request oneapi-src#1086 from vinser52/svinogra_l0_linking
```
to fix the issue in LLVM (SYCL/CUDA):

    intel/llvm#16944
    [SYCL][CUDA] Nsys profiling broken after memory providers change

Moved from: oneapi-src#2708

Fixes: intel/llvm#16944

Signed-off-by: Lukasz Dorau <[email protected]>
Adds implements calls shared between command buffer and queue in
unified-runtime level-zero v2 adapter and moves the shared code to
`command_list_manager.cpp`
As discussed in
oneapi-src#2670 (comment)
the `pCommandBufferDesc` parameter to `urCommandBufferCreateExp` is
optional. However, the UR spec doesn't state what the configuration of
the created command-buffer is when this isn't passed, and being optional
is also inconsistent with the description parameters to urSamplerCreate
& urMemImageCreate which are not optional. This PR updates the
descriptor parameter to command-buffer creation to be mandatory to
address these concerns.

Closes oneapi-src#2673

**Note**: This UR patch was previously approved and ready-to-merge in
oneapi-src#2676 prior to the
repo move
After the [spec bump of cl_khr_command_buffer to
0.9.7](https://github.com/KhronosGroup/OpenCL-Docs/), in the OpenCL
adapter we no longer need to worry about the in-order/out-of-order
property of the internal queue used on command-command-buffer creation
matching the queue used to enqueue the command-buffer.

We can therefore take advantage of the in-order flag passed on UR
command-buffer creation to use an in-order queue for command-buffer
creation, and omit using sync points.

**Note:** This UR patch was previously approved and ready-to-merge prior
to the UR repo move in
oneapi-src#2681
Fixes #16677 by only setting `-pie` linker option in Release builds on
executables rather than on any type of target.
…free related error (#16706)

UR: oneapi-src#2592

---------

Co-authored-by: Kenneth Benzie (Benie) <[email protected]>
… and improve its conformance test (#17067)

Migrated from oneapi-src#2533

This patch turns all of the values returned by urEventGetProfilingInfo
to be optional and updates adapters to handle this by returning the
appropriate enum when it is not supported.

The tests have also been updated, to ensure that returning a counter of
"0" or values equal to the previous profiling event is no longer
considered a failure.
- Fix group count not being recalculated when a user only passes a new
local work size and no new global size
- Remove CTS test skips for local update on L0
MSVC warns about a possible uninitialized variable. This is a false
positive but explicitly initializing always is harmless, so do this.
This is a first step towards reenabling UR performance testing CI. This
introduces the reusable yml workflow and a way to trigger it manually.

Here's an example how it looks:
pbalcer/llvm#2 (comment)
Use UMF Proxy pool manager with UMF CUDA memory provider in UR.

UMF Proxy pool manager is just a wrapper for the UMF memory provider
(CUDA memory provider in this case) plus it adds also tracking of memory
allocations.

Moved from: oneapi-src#2659

Signed-off-by: Lukasz Dorau <[email protected]>
There is always only one, so there's no point in allocating it via
`new`. This fixes an issue where calling `urReleaseAdapter` (or any
other UR function) in an `atexit` handler could be called after the
adapter is deleted.
This replaces the handle logic in the loader from wrapped pointers
to a ddi table at the start of the handle struct itself.
@github-actions github-actions bot added conformance Conformance test suite issues. specification Changes or additions to the specification experimental Experimental feature additions/changes/specification labels Feb 21, 2025
kbenzie pushed a commit to intel/llvm that referenced this pull request May 13, 2025
Migrated from oneapi-src/unified-runtime#2622

All handles from all backends are now required to implement `ddi_getter`
and their first field must be a pointer to a `ur_ddi_table_t` (which
also implies that they must not have a vtable).

Instead of wrapping handles in a special wrapper type, we instead query
the DDI table stored in the handle itself. This simplifies the loader
greatly.
github-actions bot pushed a commit that referenced this pull request May 14, 2025
Migrated from #2622

All handles from all backends are now required to implement `ddi_getter`
and their first field must be a pointer to a `ur_ddi_table_t` (which
also implies that they must not have a vtable).

Instead of wrapping handles in a special wrapper type, we instead query
the DDI table stored in the handle itself. This simplifies the loader
greatly.
kbenzie pushed a commit that referenced this pull request May 14, 2025
Migrated from #2622

All handles from all backends are now required to implement `ddi_getter`
and their first field must be a pointer to a `ur_ddi_table_t` (which
also implies that they must not have a vtable).

Instead of wrapping handles in a special wrapper type, we instead query
the DDI table stored in the handle itself. This simplifies the loader
greatly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
command-buffer Command Buffer feature addition/changes/specification common Changes or additions to common utilities conformance Conformance test suite issues. cuda CUDA adapter specific issues experimental Experimental feature additions/changes/specification hip HIP adapter specific issues level-zero L0 adapter specific issues loader Loader related feature/bug native-cpu Native CPU adapter specific issues opencl OpenCL adapter specific issues specification Changes or additions to the specification
Projects
None yet
Development

Successfully merging this pull request may close these issues.