You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to replace the mixed-precision quantization GEMM CUDA kernel in llama.cpp with my implementation. For this, I must understand the data arrangement and calculation logic in kernel mul_mat_vec_q.
I tried to understand the code by reading it but failed, for a large number of unknown variables and complex parallel calculations.
I want to know how I can understand the code. What do qk, qi, vdr mean?and how the kernel works?
template <ggml_type type, int ncols_y>
// tell the compiler to use as many registers as it wants, see nwarps definition below__launch_bounds__(calc_nwarps(ncols_y, get_device_table_id())*ggml_cuda_get_physical_warp_size(), 1)
static __global__ void mul_mat_vec_q(
constvoid * __restrict__ vx, constvoid * __restrict__ vy, float * __restrict__ dst,
constint ncols_x, constint nrows_x, constint nrows_y, constint nrows_dst) {
constexprint qk = ggml_cuda_type_traits<type>::qk;
constexprint qi = ggml_cuda_type_traits<type>::qi;
constexprint vdr = get_vdr_mmvq(type);
constexpr mmvq_parameter_table_id table_id = get_device_table_id();
constexprint nwarps = calc_nwarps(ncols_y, table_id);
constexprint rows_per_cuda_block = calc_rows_per_block(ncols_y, table_id);
constexprint warp_size = ggml_cuda_get_physical_warp_size();
constexprvec_dot_q_cuda_t vec_dot_q_cuda = get_vec_dot_q_cuda(type);
constint tid = warp_size*threadIdx.y + threadIdx.x;
constint row0 = rows_per_cuda_block*blockIdx.x;
constint blocks_per_row_x = ncols_x / qk;
constint blocks_per_col_y = nrows_y / QK8_1;
constexprint blocks_per_iter = vdr * nwarps * warp_size / qi;
// partial sum for each threadfloat tmp[ncols_y][rows_per_cuda_block] = {{0.0f}};
const block_q8_1 * y = (const block_q8_1 *) vy;
for (int kbx = tid / (qi/vdr); kbx < blocks_per_row_x; kbx += blocks_per_iter) {
constint kby = kbx * (qk/QK8_1); // y block index that aligns with kbx// x block quant index when casting the quants to intconstint kqs = vdr * (tid % (qi/vdr));
#pragma unroll
for (int j = 0; j < ncols_y; ++j) {
#pragma unroll
for (int i = 0; i < rows_per_cuda_block; ++i) {
tmp[j][i] += vec_dot_q_cuda(vx, &y[j*blocks_per_col_y + kby], (row0 + i)*blocks_per_row_x + kbx, kqs);
}
}
}
__shared__ float tmp_shared[nwarps-1 > 0 ? nwarps-1 : 1][ncols_y][rows_per_cuda_block][warp_size];
if (threadIdx.y > 0) {
#pragma unroll
for (int j = 0; j < ncols_y; ++j) {
#pragma unroll
for (int i = 0; i < rows_per_cuda_block; ++i) {
tmp_shared[threadIdx.y-1][j][i][threadIdx.x] = tmp[j][i];
}
}
}
__syncthreads();
if (threadIdx.y > 0) {
return;
}
// sum up partial sums and write back result
#pragma unroll
for (int j = 0; j < ncols_y; ++j) {
#pragma unroll
for (int i = 0; i < rows_per_cuda_block; ++i) {
#pragma unroll
for (int l = 0; l < nwarps-1; ++l) {
tmp[j][i] += tmp_shared[l][j][i][threadIdx.x];
}
tmp[j][i] = warp_reduce_sum<warp_size>(tmp[j][i]);
}
if (threadIdx.x < rows_per_cuda_block && (rows_per_cuda_block == 1 || row0 + threadIdx.x < (unsigned)nrows_dst)) {
dst[j*nrows_dst + row0 + threadIdx.x] = tmp[j][threadIdx.x];
}
}
GGML_UNUSED(nrows_x);
}
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm trying to replace the mixed-precision quantization GEMM CUDA kernel in llama.cpp with my implementation. For this, I must understand the data arrangement and calculation logic in kernel mul_mat_vec_q.
I tried to understand the code by reading it but failed, for a large number of unknown variables and complex parallel calculations.
I want to know how I can understand the code. What do qk, qi, vdr mean?and how the kernel works?
I really feel terrible. Who can help me.
Beta Was this translation helpful? Give feedback.
All reactions