Skip to content

Commit 4f665cd

Browse files
committed
Squashed commit of the following:
commit b617f28 Merge: 73cc5b8 92f44ff Author: Concedo <[email protected]> Date: Fri Jun 9 16:10:35 2023 +0800 Merge branch 'master' into concedo_experimental commit 73cc5b8 Author: Concedo <[email protected]> Date: Fri Jun 9 16:09:23 2023 +0800 added warning message for unsupported K quants commit 92f44ff Author: AT <[email protected]> Date: Fri Jun 9 04:00:51 2023 -0400 metal : add GELU implementation (ggml-org#1770) Co-authored-by: Adam Treat <[email protected]> commit 245fc3c Author: Kawrakow <[email protected]> Date: Fri Jun 9 10:39:59 2023 +0300 metal : faster q4_0 (ggml-org#1775) * metal : 8% faster q4_0 Avoid copying into local uchar4 anf float4. * metal : 17% faster Q4_0 Use 64 threads in a thread group. --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 01dc509 Merge: 0833845 72ff528 Author: Concedo <[email protected]> Date: Fri Jun 9 14:53:35 2023 +0800 Merge branch 'master' into concedo_experimental commit 0833845 Author: Concedo <[email protected]> Date: Fri Jun 9 14:38:31 2023 +0800 merged metal patch directly into the file commit 72ff528 Author: Kawrakow <[email protected]> Date: Thu Jun 8 22:28:21 2023 +0300 metal : add Q2_K implementation (ggml-org#1762) * metal : add Q2_K implementation 27.1 ms / token on M2 Max 30-core GPU, so about the same speed as Q4_0. Memory throughput is ~156 GB/s. The access pattern used in the Q2_K CUDA implementation resulted in significantly lower performance (~31 ms/token). * Fixing merge conflicts --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0bf7cf1 Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 20:48:14 2023 +0300 Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (ggml-org#1738)" This reverts commit 8432d4d. commit 8432d4d Author: le.chang <[email protected]> Date: Fri Jun 9 00:47:56 2023 +0800 ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (ggml-org#1738) commit 6fa1613 Author: Hyun-joo KIM <[email protected]> Date: Fri Jun 9 01:47:36 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment commit 0f291e1 Author: Kawrakow <[email protected]> Date: Thu Jun 8 19:46:22 2023 +0300 metal : Q6_K implementation (ggml-org#1752) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master * Metal implementation for Q6_K Similar to the CUDA implementation. No idea if this is the optimum for Metal, but the few alternative variants I tried all had a lower performance. We get 36.5 ms / token on M2 Max with 30 GPU cores. This corresponds to ~200 GB/second throughput. * clang-tidy : add config back * Much better Q6_K implementation for metal 28.3 ms / token for 7B. Subtracting ~9 ms that is spent in other compute graph operations, we are left with ~19 ms for the matrix multiplications. The model is ~5.5 GB, so we are getting 1000 / 19 * 5.5 = 290 GB/s! --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 7f18160 Author: Hyun-joo KIM <[email protected]> Date: Fri Jun 9 01:24:22 2023 +0900 Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment commit 8fc8179 Author: qingfengfenga <[email protected]> Date: Thu Jun 8 15:58:53 2023 +0800 Add llama.cpp docker support for non-latin languages (ggml-org#1673) * Modify Dockerfile default character set to improve compatibility (ggml-org#1673) commit b50b570 Author: Steven Roussey <[email protected]> Date: Thu Jun 8 00:12:28 2023 -0700 ggml : fix fprintf warnings (ggml-org#1720) commit 53aba3f Author: Georgi Gerganov <[email protected]> Date: Thu Jun 8 10:09:08 2023 +0300 clang-tidy : restore dot file from accidental deletion commit 4161bdc Author: Kawrakow <[email protected]> Date: Thu Jun 8 10:08:23 2023 +0300 metal : add Q4_K implementation (ggml-org#1733) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <[email protected]> commit 0035858 Author: johnson442 <[email protected]> Date: Thu Jun 8 08:02:48 2023 +0100 k-quants : add missing compile definition to CMakeLists (ggml-org#1748)
1 parent dee692a commit 4f665cd

File tree

4 files changed

+620
-36
lines changed

4 files changed

+620
-36
lines changed

ggml-metal.m

+74-3
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,20 @@
4545
GGML_METAL_DECL_KERNEL(scale);
4646
GGML_METAL_DECL_KERNEL(silu);
4747
GGML_METAL_DECL_KERNEL(relu);
48+
GGML_METAL_DECL_KERNEL(gelu);
4849
GGML_METAL_DECL_KERNEL(soft_max);
4950
GGML_METAL_DECL_KERNEL(diag_mask_inf);
5051
GGML_METAL_DECL_KERNEL(get_rows_f16);
5152
GGML_METAL_DECL_KERNEL(get_rows_q4_0);
53+
GGML_METAL_DECL_KERNEL(get_rows_q2_k);
54+
GGML_METAL_DECL_KERNEL(get_rows_q4_k);
55+
GGML_METAL_DECL_KERNEL(get_rows_q6_k);
5256
GGML_METAL_DECL_KERNEL(rms_norm);
5357
GGML_METAL_DECL_KERNEL(mul_mat_f16_f32);
5458
GGML_METAL_DECL_KERNEL(mul_mat_q4_0_f32);
59+
GGML_METAL_DECL_KERNEL(mul_mat_q2_k_f32);
60+
GGML_METAL_DECL_KERNEL(mul_mat_q4_k_f32);
61+
GGML_METAL_DECL_KERNEL(mul_mat_q6_k_f32);
5562
GGML_METAL_DECL_KERNEL(rope);
5663
GGML_METAL_DECL_KERNEL(cpy_f32_f16);
5764
GGML_METAL_DECL_KERNEL(cpy_f32_f32);
@@ -99,7 +106,7 @@
99106
NSError * error = nil;
100107

101108
//NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
102-
NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
109+
NSString * path = @"./ggml-metal.metal";
103110
fprintf(stderr, "%s: loading '%s'\n", __func__, [path UTF8String]);
104111

105112
NSString * src = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
@@ -129,13 +136,20 @@
129136
GGML_METAL_ADD_KERNEL(scale);
130137
GGML_METAL_ADD_KERNEL(silu);
131138
GGML_METAL_ADD_KERNEL(relu);
139+
GGML_METAL_ADD_KERNEL(gelu);
132140
GGML_METAL_ADD_KERNEL(soft_max);
133141
GGML_METAL_ADD_KERNEL(diag_mask_inf);
134142
GGML_METAL_ADD_KERNEL(get_rows_f16);
135143
GGML_METAL_ADD_KERNEL(get_rows_q4_0);
144+
GGML_METAL_ADD_KERNEL(get_rows_q2_k);
145+
GGML_METAL_ADD_KERNEL(get_rows_q4_k);
146+
GGML_METAL_ADD_KERNEL(get_rows_q6_k);
136147
GGML_METAL_ADD_KERNEL(rms_norm);
137148
GGML_METAL_ADD_KERNEL(mul_mat_f16_f32);
138149
GGML_METAL_ADD_KERNEL(mul_mat_q4_0_f32);
150+
GGML_METAL_ADD_KERNEL(mul_mat_q2_k_f32);
151+
GGML_METAL_ADD_KERNEL(mul_mat_q4_k_f32);
152+
GGML_METAL_ADD_KERNEL(mul_mat_q6_k_f32);
139153
GGML_METAL_ADD_KERNEL(rope);
140154
GGML_METAL_ADD_KERNEL(cpy_f32_f16);
141155
GGML_METAL_ADD_KERNEL(cpy_f32_f32);
@@ -408,6 +422,20 @@ void ggml_metal_graph_compute(
408422

409423
[encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
410424
} break;
425+
case GGML_OP_GELU:
426+
{
427+
if (encoder == nil) {
428+
encoder = [command_buffer computeCommandEncoder];
429+
}
430+
431+
[encoder setComputePipelineState:ctx->pipeline_gelu];
432+
[encoder setBuffer:id_src0 offset:offs_src0 atIndex:0];
433+
[encoder setBuffer:id_dst offset:offs_dst atIndex:1];
434+
435+
const int64_t n = ggml_nelements(dst);
436+
437+
[encoder dispatchThreadgroups:MTLSizeMake(n, 1, 1) threadsPerThreadgroup:MTLSizeMake(1, 1, 1)];
438+
} break;
411439
case GGML_OP_SOFT_MAX:
412440
{
413441
if (encoder == nil) {
@@ -514,10 +542,41 @@ void ggml_metal_graph_compute(
514542
GGML_ASSERT(ne12 == 1);
515543

516544
nth0 = 8;
517-
nth1 = 4;
545+
nth1 = 8;
518546
[encoder setComputePipelineState:ctx->pipeline_mul_mat_q4_0_f32];
519547
} break;
520-
default: GGML_ASSERT(false && "not implemented");
548+
case GGML_TYPE_Q2_K:
549+
{
550+
GGML_ASSERT(ne02 == 1);
551+
GGML_ASSERT(ne12 == 1);
552+
553+
nth0 = 4;
554+
nth1 = 16;
555+
[encoder setComputePipelineState:ctx->pipeline_mul_mat_q2_k_f32];
556+
} break;
557+
case GGML_TYPE_Q4_K:
558+
{
559+
GGML_ASSERT(ne02 == 1);
560+
GGML_ASSERT(ne12 == 1);
561+
562+
nth0 = 4;
563+
nth1 = 16;
564+
[encoder setComputePipelineState:ctx->pipeline_mul_mat_q4_k_f32];
565+
} break;
566+
case GGML_TYPE_Q6_K:
567+
{
568+
GGML_ASSERT(ne02 == 1);
569+
GGML_ASSERT(ne12 == 1);
570+
571+
nth0 = 4;
572+
nth1 = 16;
573+
[encoder setComputePipelineState:ctx->pipeline_mul_mat_q6_k_f32];
574+
} break;
575+
default:
576+
{
577+
fprintf(stderr, "Asserting on type %d\n",(int)src0t);
578+
GGML_ASSERT(false && "not implemented");
579+
}
521580
};
522581

523582

@@ -540,6 +599,15 @@ void ggml_metal_graph_compute(
540599
if (src0t == GGML_TYPE_Q4_0) {
541600
[encoder setThreadgroupMemoryLength:nth0*nth1*sizeof(float) atIndex:0];
542601
[encoder dispatchThreadgroups:MTLSizeMake(ne01, ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
602+
} else if (src0t == GGML_TYPE_Q2_K) {
603+
[encoder setThreadgroupMemoryLength:nth0*nth1*sizeof(float) atIndex:0];
604+
[encoder dispatchThreadgroups:MTLSizeMake(ne01, 1, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
605+
} else if (src0t == GGML_TYPE_Q4_K) {
606+
[encoder setThreadgroupMemoryLength:nth0*nth1*sizeof(float) atIndex:0];
607+
[encoder dispatchThreadgroups:MTLSizeMake(ne01, ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
608+
} else if (src0t == GGML_TYPE_Q6_K) {
609+
[encoder setThreadgroupMemoryLength:nth0*nth1*sizeof(float) atIndex:0];
610+
[encoder dispatchThreadgroups:MTLSizeMake(ne01, ne11, 1) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
543611
} else {
544612
[encoder setThreadgroupMemoryLength:nth0*sizeof(float) atIndex:0];
545613
[encoder dispatchThreadgroups:MTLSizeMake(ne01, ne11, ne12) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
@@ -555,6 +623,9 @@ void ggml_metal_graph_compute(
555623
switch (src0->type) {
556624
case GGML_TYPE_F16: [encoder setComputePipelineState:ctx->pipeline_get_rows_f16]; break;
557625
case GGML_TYPE_Q4_0: [encoder setComputePipelineState:ctx->pipeline_get_rows_q4_0]; break;
626+
case GGML_TYPE_Q2_K: [encoder setComputePipelineState:ctx->pipeline_get_rows_q2_k]; break;
627+
case GGML_TYPE_Q4_K: [encoder setComputePipelineState:ctx->pipeline_get_rows_q4_k]; break;
628+
case GGML_TYPE_Q6_K: [encoder setComputePipelineState:ctx->pipeline_get_rows_q6_k]; break;
558629
default: GGML_ASSERT(false && "not implemented");
559630
}
560631

0 commit comments

Comments
 (0)