Skip to content

Commit 2307523

Browse files
0cc4mSlyEchoLostRuinsslarenggerganov
authored
ggml : add Vulkan backend (ggml-org#2059)
* Vulkan loader code * Fix matmul kernel, continue implementation * Continue implementation * Vulkan memory management * Vulkan development * Matmul call * Add aligned malloc and free for VMA * Continue implementation * First matmul success * GEMM Kernel optimization * 1D Blocktiling * 2D Blocktiling * Write coalescing * Continue vulkan implementation and optimization * First FP16 attempt, disabled for now * Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel * Enable device extensions properly, restore fp16 matmul op * Fix mulmat_f16 * Output FP32 in fp16 matmul shader * Fix f16_to_f32 kernel * dequant_q4_0 kernel * Add VMA library * Avoid requesting dedicated memory, VMA can decide that by itself * Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly * add cmake commands * Add 2d write operation, profiling code * Fix 2d write * Fix queue selection for AMD RADV * Fix trailing whitespace in vk_mem_alloc.h * Add WIP warp tile mat mul shaders * Disable glslc optimization * Disable glslc optimization for CMake * Optimize warptile matmul shader, replace blocktile with it * Add split-k optimization for small matrix multiplication Use semaphores for synchronization instead of fences or waitidle Rework async write/read for synchronization * Fix validation errors, improve compatibility with AMD GPUs * Rework command buffer handling * Variable matmul kernel using specialization constants * Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints * Reuse semaphores * Handle stage flags during command buffer submission properly * Increase matmul test runs for consistent results * Fix F32 matmul * Add vectorized loading and zeropadding for matrix multiplication * Use pinned memory for f16 preprocessing * Don't force aligned matmul * Don't free before queue done * Replace VMA library with native Vulkan buffer management * Basic offloading support with mul_f32 and dmmv for q4_0 * Run glslc commands in parallel * Unroll loops in dmmv shader * Reduce usage of waitIdle * Reuse pinned allocation for f16 conversion * Handle devices with only a single queue * Fix trailing whitespace in CMakeLists.txt * Allow parallel execution of kernels, parallelize third and fourth dimension calls * Add fallback for devices only supporting one DescriptorSet per DescriptorPool * Move to graph function similar to CUDA implementation * Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function * Add F32 dmmv shaders * Batch submissions * Add .spv to gitignore * Split off matrix vector multiplication for separate optimization * Use single command buffer for matrix vector multiplication ops * Reduce overhead of mul_f32 calls by using a single command buffer * Add submission batching to mul_f32 * Fix tests * Add missing barrier * Add further missing barrier * Add further ops * Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions * Remove unnecessary cblas link * Fix descriptor set pre-allocation assert * Add runtime shader compilation, start transferring shaders to this approach * Transfer remaining shaders to header and compile on runtime * Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16 * Add support for q4_1, q5_0, q5_1 and q8_0 * Remove unnecessary scalar layout extension * Parse graph early to pre-record command buffers * Add q6_k support * Add multi-submit for command buffers * Fix q6_k dequant shader for AMD * Fix q6_k for GPUs without fp16 support * Simplify q6_k fp16 fix * Minor fixes * Fix wg_denom of m-mulmat shaders * Add Python-based Vulkan shader generator * Replace shaderc dependency with precompiled shaders Fix python script to generate shaders * Clean up code * Fix shader generator script Windows compatibility Co-authored-by: Concedo <[email protected]> * Close file before deletion * Fix vulkan shader fp32 name * Add q2_k and q3_k support Add validation check to compare shader results to cpu results * Add q4_k support * Add q5_k support * Bake SPIR-V bytecode into the library instead of loading shaders from file * Switch to signal semaphores for flexibility Prepare broadcasting support for mul mat * Finish broadcasting mul mat support for GQA * Clean up unused functions Add repeat op * Add further ops, not yet enabled. Improve semaphore code * Reduce number of used semaphores by utilizing timelines more properly * Remove queue information * Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations * Add Vulkan to llama-bench * Remove cblas dependency * Fix matmul k-split bug * Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader * Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug * Fix issues with float16 overflows in shaders * Fix issues with older Vulkan headers on Ubuntu 22.04 * Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers * Implement further ops, rework op_f32 calls, fix bugs * Finish full offloading support, add last remaining ops, fix bugs, remove redundant code * Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders * Merge upstream changes, fix conflicts, adapt soft_max op * Fix Python and shader header format * Free model gpu buffers on exit * Use single queue per device to simplify code * Add matmul shader support for running multiple calculations in parallel * Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible * Fix missing event cast * Replace uint64_t(-1) with UINT64_MAX, rename function for clarity * Fix warning about empty C function parameters * Fix compiler warnings * Properly implement Vulkan backend buffer handling * Fix oversized host staging buffers * Simplify barrier synchronization calls * Fix gcc warnings * Implement max_size for backend buffer types to limit the size of a single allocation * Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size * refactor multi buf * Disable unsupported ops to fix tests * Check for maintenance4 support before using it * Handle devices with only a single queue * Fix single queue logic * propagate buffer usage in multi buffers * Implement rope_neox op * Cleanup header and other files * Simplify gpu_extras by removing events and putting staging memcpys into contexts * Move queue into context Add not-yet-enabled async backend ops * Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization * Add get_max_size to SYCL backend. Co-authored-by: Georgi Gerganov <[email protected]> * llama : fix trailing whitespace --------- Co-authored-by: Henri Vasserman <[email protected]> Co-authored-by: Concedo <[email protected]> Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
1 parent 0f64857 commit 2307523

19 files changed

+69294
-34
lines changed

CMakeLists.txt

+17
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ set(LLAMA_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
9999
option(LLAMA_HIPBLAS "llama: use hipBLAS" OFF)
100100
option(LLAMA_HIP_UMA "llama: use HIP unified memory architecture" OFF)
101101
option(LLAMA_CLBLAST "llama: use CLBlast" OFF)
102+
option(LLAMA_VULKAN "llama: use Vulkan" OFF)
102103
option(LLAMA_METAL "llama: use Metal" ${LLAMA_METAL_DEFAULT})
103104
option(LLAMA_METAL_NDEBUG "llama: disable Metal debugging" OFF)
104105
option(LLAMA_METAL_SHADER_DEBUG "llama: compile Metal with -fno-fast-math" OFF)
@@ -416,6 +417,22 @@ if (LLAMA_CLBLAST)
416417
endif()
417418
endif()
418419

420+
if (LLAMA_VULKAN)
421+
find_package(Vulkan)
422+
if (Vulkan_FOUND)
423+
message(STATUS "Vulkan found")
424+
425+
add_library(ggml-vulkan STATIC ggml-vulkan.cpp ggml-vulkan.h)
426+
target_link_libraries(ggml-vulkan PRIVATE Vulkan::Vulkan)
427+
428+
add_compile_definitions(GGML_USE_VULKAN)
429+
430+
set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} ggml-vulkan)
431+
else()
432+
message(WARNING "Vulkan not found")
433+
endif()
434+
endif()
435+
419436
if (LLAMA_HIPBLAS)
420437
list(APPEND CMAKE_PREFIX_PATH /opt/rocm)
421438

Makefile

+13
Original file line numberDiff line numberDiff line change
@@ -448,6 +448,19 @@ ggml-opencl.o: ggml-opencl.cpp ggml-opencl.h
448448
$(CXX) $(CXXFLAGS) -c $< -o $@
449449
endif # LLAMA_CLBLAST
450450

451+
ifdef LLAMA_VULKAN
452+
MK_CPPFLAGS += -DGGML_USE_VULKAN
453+
MK_LDFLAGS += -lvulkan
454+
OBJS += ggml-vulkan.o
455+
456+
ifdef LLAMA_VULKAN_CHECK_RESULTS
457+
MK_CPPFLAGS += -DGGML_VULKAN_CHECK_RESULTS
458+
endif
459+
460+
ggml-vulkan.o: ggml-vulkan.cpp ggml-vulkan.h
461+
$(CXX) $(CXXFLAGS) -c $< -o $@
462+
endif # LLAMA_VULKAN
463+
451464
ifdef LLAMA_HIPBLAS
452465

453466
ifeq ($(wildcard /opt/rocm),)

examples/llama-bench/llama-bench.cpp

+8-3
Original file line numberDiff line numberDiff line change
@@ -562,6 +562,7 @@ struct test {
562562
static const int build_number;
563563
static const bool cuda;
564564
static const bool opencl;
565+
static const bool vulkan;
565566
static const bool metal;
566567
static const bool gpu_blas;
567568
static const bool blas;
@@ -643,6 +644,9 @@ struct test {
643644
if (opencl) {
644645
return "OpenCL";
645646
}
647+
if (vulkan) {
648+
return "Vulkan";
649+
}
646650
if (metal) {
647651
return "Metal";
648652
}
@@ -658,7 +662,7 @@ struct test {
658662
static const std::vector<std::string> & get_fields() {
659663
static const std::vector<std::string> fields = {
660664
"build_commit", "build_number",
661-
"cuda", "opencl", "metal", "gpu_blas", "blas",
665+
"cuda", "opencl", "vulkan", "metal", "gpu_blas", "blas",
662666
"cpu_info", "gpu_info",
663667
"model_filename", "model_type", "model_size", "model_n_params",
664668
"n_batch", "n_threads", "type_k", "type_v",
@@ -682,7 +686,7 @@ struct test {
682686
field == "avg_ns" || field == "stddev_ns") {
683687
return INT;
684688
}
685-
if (field == "cuda" || field == "opencl" || field == "metal" || field == "gpu_blas" || field == "blas" ||
689+
if (field == "cuda" || field == "opencl" || field == "vulkan"|| field == "metal" || field == "gpu_blas" || field == "blas" ||
686690
field == "f16_kv" || field == "no_kv_offload" || field == "mul_mat_q") {
687691
return BOOL;
688692
}
@@ -710,7 +714,7 @@ struct test {
710714
}
711715
std::vector<std::string> values = {
712716
build_commit, std::to_string(build_number),
713-
std::to_string(cuda), std::to_string(opencl), std::to_string(metal), std::to_string(gpu_blas), std::to_string(blas),
717+
std::to_string(cuda), std::to_string(opencl), std::to_string(vulkan), std::to_string(metal), std::to_string(gpu_blas), std::to_string(blas),
714718
cpu_info, gpu_info,
715719
model_filename, model_type, std::to_string(model_size), std::to_string(model_n_params),
716720
std::to_string(n_batch), std::to_string(n_threads), ggml_type_name(type_k), ggml_type_name(type_v),
@@ -738,6 +742,7 @@ const std::string test::build_commit = LLAMA_COMMIT;
738742
const int test::build_number = LLAMA_BUILD_NUMBER;
739743
const bool test::cuda = !!ggml_cpu_has_cublas();
740744
const bool test::opencl = !!ggml_cpu_has_clblast();
745+
const bool test::vulkan = !!ggml_cpu_has_vulkan();
741746
const bool test::metal = !!ggml_cpu_has_metal();
742747
const bool test::gpu_blas = !!ggml_cpu_has_gpublas();
743748
const bool test::blas = !!ggml_cpu_has_blas();

ggml-alloc.c

+82-24
Original file line numberDiff line numberDiff line change
@@ -778,38 +778,26 @@ size_t ggml_allocr_alloc_graph(ggml_allocr_t alloc, struct ggml_cgraph * graph)
778778
}
779779

780780
// utils
781-
ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft) {
782-
GGML_ASSERT(ggml_get_no_alloc(ctx) == true);
783-
784-
size_t alignment = ggml_backend_buft_get_alignment(buft);
785-
786-
size_t nbytes = 0;
787-
for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
788-
if (t->data == NULL && t->view_src == NULL) {
789-
nbytes += GGML_PAD(ggml_backend_buft_get_alloc_size(buft, t), alignment);
790-
}
791-
}
792-
793-
if (nbytes == 0) {
794-
// all the tensors in the context are already allocated
795-
#ifndef NDEBUG
796-
fprintf(stderr, "%s: all tensors in the context are already allocated\n", __func__);
797-
#endif
798-
return NULL;
799-
}
800781

801-
ggml_backend_buffer_t buffer = ggml_backend_buft_alloc_buffer(buft, nbytes);
782+
static bool alloc_tensor_range(struct ggml_context * ctx,
783+
struct ggml_tensor * first, struct ggml_tensor * last,
784+
ggml_backend_buffer_type_t buft, size_t size,
785+
ggml_backend_buffer_t ** buffers, size_t * n_buffers) {
786+
ggml_backend_buffer_t buffer = ggml_backend_buft_alloc_buffer(buft, size);
802787
if (buffer == NULL) {
803-
// failed to allocate buffer
804788
#ifndef NDEBUG
805-
fprintf(stderr, "%s: failed to allocate buffer\n", __func__);
789+
fprintf(stderr, "%s: failed to allocate %s buffer of size %zu\n", __func__, ggml_backend_buft_name(buft), size);
806790
#endif
807-
return NULL;
791+
for (size_t i = 0; i < *n_buffers; i++) {
792+
ggml_backend_buffer_free(*buffers[i]);
793+
}
794+
free(buffers);
795+
return false;
808796
}
809797

810798
ggml_tallocr_t tallocr = ggml_tallocr_new_from_buffer(buffer);
811799

812-
for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
800+
for (struct ggml_tensor * t = first; t != last; t = ggml_get_next_tensor(ctx, t)) {
813801
if (t->data == NULL) {
814802
if (t->view_src == NULL) {
815803
ggml_tallocr_alloc(tallocr, t);
@@ -826,6 +814,76 @@ ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_conte
826814

827815
ggml_tallocr_free(tallocr);
828816

817+
*buffers = realloc(*buffers, sizeof(ggml_backend_buffer_t) * (*n_buffers + 1));
818+
(*buffers)[(*n_buffers)++] = buffer;
819+
820+
return true;
821+
}
822+
823+
ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft) {
824+
GGML_ASSERT(ggml_get_no_alloc(ctx) == true);
825+
826+
size_t alignment = ggml_backend_buft_get_alignment(buft);
827+
size_t max_size = ggml_backend_buft_get_max_size(buft);
828+
829+
ggml_backend_buffer_t * buffers = NULL;
830+
size_t n_buffers = 0;
831+
832+
size_t cur_buf_size = 0;
833+
struct ggml_tensor * first = ggml_get_first_tensor(ctx);
834+
for (struct ggml_tensor * t = first; t != NULL; t = ggml_get_next_tensor(ctx, t)) {
835+
size_t this_size = 0;
836+
if (t->data == NULL && t->view_src == NULL) {
837+
this_size = GGML_PAD(ggml_backend_buft_get_alloc_size(buft, t), alignment);
838+
}
839+
840+
if (this_size > max_size) {
841+
// tensor is too large to fit in a single buffer
842+
fprintf(stderr, "%s: tensor %s is too large to fit in a %s buffer (tensor size: %zu, max buffer size: %zu)\n",
843+
__func__, t->name,
844+
ggml_backend_buft_name(buft),
845+
this_size, max_size);
846+
for (size_t i = 0; i < n_buffers; i++) {
847+
ggml_backend_buffer_free(buffers[i]);
848+
}
849+
free(buffers);
850+
return NULL;
851+
}
852+
853+
if ((cur_buf_size + this_size) > max_size) {
854+
// allocate tensors in the current buffer
855+
if (!alloc_tensor_range(ctx, first, t, buft, cur_buf_size, &buffers, &n_buffers)) {
856+
return NULL;
857+
}
858+
first = t;
859+
cur_buf_size = this_size;
860+
} else {
861+
cur_buf_size += this_size;
862+
}
863+
}
864+
865+
// allocate remaining tensors
866+
if (cur_buf_size > 0) {
867+
if (!alloc_tensor_range(ctx, first, NULL, buft, cur_buf_size, &buffers, &n_buffers)) {
868+
return NULL;
869+
}
870+
}
871+
872+
if (n_buffers == 0) {
873+
// all the tensors in the context are already allocated
874+
#ifndef NDEBUG
875+
fprintf(stderr, "%s: all tensors in the context are already allocated\n", __func__);
876+
#endif
877+
return NULL;
878+
}
879+
880+
ggml_backend_buffer_t buffer;
881+
if (n_buffers == 1) {
882+
buffer = buffers[0];
883+
} else {
884+
buffer = ggml_backend_multi_buffer_alloc_buffer(buffers, n_buffers);
885+
}
886+
free(buffers);
829887
return buffer;
830888
}
831889

ggml-backend-impl.h

+6
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ extern "C" {
1919
const char * (*GGML_CALL get_name) (ggml_backend_buffer_type_t buft);
2020
ggml_backend_buffer_t (*GGML_CALL alloc_buffer) (ggml_backend_buffer_type_t buft, size_t size);
2121
size_t (*GGML_CALL get_alignment) (ggml_backend_buffer_type_t buft); // tensor alignment
22+
size_t (*GGML_CALL get_max_size) (ggml_backend_buffer_type_t buft); // allocation max size
2223
size_t (*GGML_CALL get_alloc_size) (ggml_backend_buffer_type_t buft, const struct ggml_tensor * tensor); // data size needed to allocate the tensor, including padding
2324
bool (*GGML_CALL supports_backend)(ggml_backend_buffer_type_t buft, ggml_backend_t backend); // check if the buffer type is usable by the backend
2425
// check if tensor data is in host memory
@@ -63,6 +64,11 @@ extern "C" {
6364
// do not use directly, use ggml_backend_tensor_copy instead
6465
bool ggml_backend_buffer_copy_tensor(const struct ggml_tensor * src, struct ggml_tensor * dst);
6566

67+
// buffer that contains a collection of buffers
68+
GGML_CALL ggml_backend_buffer_t ggml_backend_multi_buffer_alloc_buffer(ggml_backend_buffer_t * buffers, size_t n_buffers);
69+
GGML_CALL bool ggml_backend_buffer_is_multi_buffer(ggml_backend_buffer_t buffer);
70+
GGML_CALL void ggml_backend_multi_buffer_set_usage(ggml_backend_buffer_t buffer, enum ggml_backend_buffer_usage usage);
71+
6672
//
6773
// Backend
6874
//

0 commit comments

Comments
 (0)