You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
commit b617f28
Merge: 73cc5b892f44ff
Author: Concedo <[email protected]>
Date: Fri Jun 9 16:10:35 2023 +0800
Merge branch 'master' into concedo_experimental
commit 73cc5b8
Author: Concedo <[email protected]>
Date: Fri Jun 9 16:09:23 2023 +0800
added warning message for unsupported K quants
commit 92f44ff
Author: AT <[email protected]>
Date: Fri Jun 9 04:00:51 2023 -0400
metal : add GELU implementation (ggml-org#1770)
Co-authored-by: Adam Treat <[email protected]>
commit 245fc3c
Author: Kawrakow <[email protected]>
Date: Fri Jun 9 10:39:59 2023 +0300
metal : faster q4_0 (ggml-org#1775)
* metal : 8% faster q4_0
Avoid copying into local uchar4 anf float4.
* metal : 17% faster Q4_0
Use 64 threads in a thread group.
---------
Co-authored-by: Iwan Kawrakow <[email protected]>
commit 01dc509
Merge: 083384572ff528
Author: Concedo <[email protected]>
Date: Fri Jun 9 14:53:35 2023 +0800
Merge branch 'master' into concedo_experimental
commit 0833845
Author: Concedo <[email protected]>
Date: Fri Jun 9 14:38:31 2023 +0800
merged metal patch directly into the file
commit 72ff528
Author: Kawrakow <[email protected]>
Date: Thu Jun 8 22:28:21 2023 +0300
metal : add Q2_K implementation (ggml-org#1762)
* metal : add Q2_K implementation
27.1 ms / token on M2 Max 30-core GPU, so about the
same speed as Q4_0. Memory throughput is ~156 GB/s.
The access pattern used in the Q2_K
CUDA implementation resulted in significantly lower
performance (~31 ms/token).
* Fixing merge conflicts
---------
Co-authored-by: Iwan Kawrakow <[email protected]>
commit 0bf7cf1
Author: Georgi Gerganov <[email protected]>
Date: Thu Jun 8 20:48:14 2023 +0300
Revert "ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (ggml-org#1738)"
This reverts commit 8432d4d.
commit 8432d4d
Author: le.chang <[email protected]>
Date: Fri Jun 9 00:47:56 2023 +0800
ggml : load data into int8x16x4_t using vld4q_s8 on arm64 (ggml-org#1738)
commit 6fa1613
Author: Hyun-joo KIM <[email protected]>
Date: Fri Jun 9 01:47:36 2023 +0900
Metal inference enhancement - put hard-wired relative path of ggml-model.model file using a patch file due to lack of NSBundle environment
commit 0f291e1
Author: Kawrakow <[email protected]>
Date: Thu Jun 8 19:46:22 2023 +0300
metal : Q6_K implementation (ggml-org#1752)
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
* Metal implementation for Q6_K
Similar to the CUDA implementation.
No idea if this is the optimum for Metal, but the few
alternative variants I tried all had a lower performance.
We get 36.5 ms / token on M2 Max with 30 GPU cores.
This corresponds to ~200 GB/second throughput.
* clang-tidy : add config back
* Much better Q6_K implementation for metal
28.3 ms / token for 7B. Subtracting ~9 ms that is spent in
other compute graph operations, we are left with ~19 ms
for the matrix multiplications. The model is ~5.5 GB,
so we are getting 1000 / 19 * 5.5 = 290 GB/s!
---------
Co-authored-by: Iwan Kawrakow <[email protected]>
commit 7f18160
Author: Hyun-joo KIM <[email protected]>
Date: Fri Jun 9 01:24:22 2023 +0900
Metal inference enhancement - put hard-wired relative path of ggml-model.model file due to lack of NSBundle environment
commit 8fc8179
Author: qingfengfenga <[email protected]>
Date: Thu Jun 8 15:58:53 2023 +0800
Add llama.cpp docker support for non-latin languages (ggml-org#1673)
* Modify Dockerfile default character set to improve compatibility (ggml-org#1673)
commit b50b570
Author: Steven Roussey <[email protected]>
Date: Thu Jun 8 00:12:28 2023 -0700
ggml : fix fprintf warnings (ggml-org#1720)
commit 53aba3f
Author: Georgi Gerganov <[email protected]>
Date: Thu Jun 8 10:09:08 2023 +0300
clang-tidy : restore dot file from accidental deletion
commit 4161bdc
Author: Kawrakow <[email protected]>
Date: Thu Jun 8 10:08:23 2023 +0300
metal : add Q4_K implementation (ggml-org#1733)
* Metal implementation for Q4_K
Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.
* Optimizing Q4_K on metal
The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.
At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.
* Optimizing q4_K metal dot some more
For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.
* Fix after merge with master
---------
Co-authored-by: Iwan Kawrakow <[email protected]>
commit 0035858
Author: johnson442 <[email protected]>
Date: Thu Jun 8 08:02:48 2023 +0100
k-quants : add missing compile definition to CMakeLists (ggml-org#1748)
0 commit comments