Skip to content

Commit 5b5fce9

Browse files
fmzmax-krasnyanskyfmzslaren
authored andcommitted
Threadpool: take 2 (ggml-org#8672)
* Introduce ggml_compute_threadpool - OpenMP functional: check - Vanilla ggml functional: Check - ggml w/threadpool functional: Check - OpenMP no regression: No glaring problems - Vanilla ggml no regression: No glaring problems - ggml w/threadpool no regression: No glaring problems * Minor fixes * fixed use after release bug * fixed a harmless race condition * Fix Android bulid issue * fix more race conditions * fix deadlock for cases where cgraph.n_nodes == 1 and fix --poll case * threadpool: use cpu_get_num_math to set the default number of threadpool threads This way we avoid using E-Cores and Hyperthreaded siblings. * bench: create fresh threadpool for each test For benchmarking it's better to start a fresh pool for each test with the exact number of threads needed for that test. Having larger pools is suboptimal (causes more load, etc). * atomics: always use stdatomics with clang and use relaxed memory order when polling in ggml_barrier This also removes sched_yield() calls from ggml_barrier() to match OpenMP behavior. * threadpool: make polling the default to match openmp behavior All command line args now allow for setting poll to 0 (false). * threadpool: do not wakeup threads in already paused threadpool * fix potential race condition in check_for_work * threadpool: do not create two threadpools if their params are identical * threadpool: reduce pause/resume/wakeup overhead in common cases We now start threadpool in paused state only if we have two. The resume is now implicit (ie new work) which allows for reduced locking and context-switch overhead. * threadpool: add support for hybrid polling poll params (--poll, ...) now specify "polling level", i.e. how aggresively we poll before waiting on cond.var. poll=0 means no polling, 1 means poll for 128K rounds then wait, 2 for 256K rounds, ... The default value of 50 (ie 50x128K rounds) seems like a decent default across modern platforms. We can tune this further as things evolve. * threadpool: reduce the number of barrier required New work is now indicated with an atomic counter that is incremented for each new graph that needs to be computed. This removes the need for extra barrier for clearing the "new_work" and removes the special case for trivial graphs. * threadpool: remove special-casing for disposable threadpools With the efficient hybrid polling there is no need to make disposable pools any different. This simplifies the overall logic and reduces branching. Include n_threads in debug print for disposable threadpool. Declare pause and stop flags as atomic_bool This doesn't actually generate any memory barriers and simply informs the thread sanitizer that these flags can be written & read by different threads without locking. * threadpool: do not clear barrier counters between graphs computes (fixes race with small graphs) This fixes the race condition with very small graphs where the main thread happens to start a new graph while the workers are just about to exit from barriers. * threadpool: use relaxed order for chunk sync Full memory barrier is an overkill for this since each thread works on different chunk * threadpool: remove abort_callback from threadpool state * threadpool: better naming for thread/cpumask releated functions * threadpool: consistent use of int type for n_threads params * threadpool: add support for ggml_threadpool_params_default/init Also removes the need for explicit mask_specified param. all-zero cpumask means use default (usually inherited) cpu affinity mask. * threadpool: move typedef into ggml.h * threadpool: fix apply_priority() function name * threadpool: fix swift wrapper errors due to n_threads int type cleanup * threadpool: enable --cpu-mask and other threadpool related options only if threadpool is enabled * threadpool: replace checks for compute_thread ret code with proper status check * threadpool: simplify threadpool init logic and fix main thread affinity application Most of the init code is now exactly the same between threadpool and openmp. * threadpool: update threadpool resume/pause function names * threadpool: enable openmp by default for now * threadpool: don't forget to free workers state when omp is enabled * threadpool: avoid updating process priority on the platforms that do not require it On Windows we need to change overall process priority class in order to set thread priorities, but on Linux, Mac, etc we do not need to touch the overall process settings. * threadpool: update calling thread prio and affinity only at start/resume This avoids extra syscalls for each graph_compute() * llama-bench: turn threadpool params into vectors, add output headers, etc * llama-bench: add support for cool off between tests --delay This helps for long running tests on platforms that are thermally limited (phones, laptops, etc). --delay (disabled by default) introduces the sleep for N seconds before starting each test. * threadpool: move process priority setting into the apps (bench and cli) This avoids changing the overall process priority on Windows for the apps that use ggml/llama.cpp directy. * threadpool: move all pause/resume logic into ggml * threadpool: futher api cleanup and prep for future refactoring All threadpool related functions and structs use ggml_threadpool prefix. * threadpool: minor indent fixes * threadpool: improve setprioty error message * Update examples/llama-bench/llama-bench.cpp Co-authored-by: slaren <[email protected]> * threadpool: fix indent in set_threadpool call * use int32_t for n_thread type in public llama.cpp API * threadpool: use _new and _free instead of _create and _release * fix two more public APIs to use int32_t for n_threads * build: set _GNU_SOURCE for Adroid --------- Co-authored-by: Max Krasnyansky <[email protected]> Co-authored-by: fmz <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]> Co-authored-by: slaren <[email protected]>
1 parent 88fa42c commit 5b5fce9

File tree

22 files changed

+1311
-258
lines changed

22 files changed

+1311
-258
lines changed

common/common.cpp

+327-23
Large diffs are not rendered by default.

common/common.h

+23-7
Original file line numberDiff line numberDiff line change
@@ -67,13 +67,18 @@ enum dimre_method {
6767
DIMRE_METHOD_MEAN,
6868
};
6969

70+
struct cpu_params {
71+
int n_threads = -1;
72+
bool cpumask[GGML_MAX_N_THREADS] = {false}; // CPU affinity mask.
73+
bool mask_valid = false; // Default: any CPU
74+
enum ggml_sched_priority priority = GGML_SCHED_PRIO_NORMAL; // Scheduling prio : (0 - normal, 1 - medium, 2 - high, 3 - realtime)
75+
bool strict_cpu = false; // Use strict CPU placement
76+
uint32_t poll = 50; // Polling (busywait) level (0 - no polling, 100 - mostly polling)
77+
};
78+
7079
struct gpt_params {
7180
uint32_t seed = LLAMA_DEFAULT_SEED; // RNG seed
7281

73-
int32_t n_threads = cpu_get_num_math();
74-
int32_t n_threads_draft = -1;
75-
int32_t n_threads_batch = -1; // number of threads to use for batch processing (-1 = use n_threads)
76-
int32_t n_threads_batch_draft = -1;
7782
int32_t n_predict = -1; // new tokens to predict
7883
int32_t n_ctx = 0; // context size
7984
int32_t n_batch = 2048; // logical batch size for prompt processing (must be >=32 to use BLAS)
@@ -100,6 +105,11 @@ struct gpt_params {
100105
int32_t yarn_orig_ctx = 0; // YaRN original context length
101106
float defrag_thold = -1.0f; // KV cache defragmentation threshold
102107

108+
struct cpu_params cpuparams;
109+
struct cpu_params cpuparams_batch;
110+
struct cpu_params draft_cpuparams;
111+
struct cpu_params draft_cpuparams_batch;
112+
103113
ggml_backend_sched_eval_callback cb_eval = nullptr;
104114
void * cb_eval_user_data = nullptr;
105115

@@ -204,7 +214,7 @@ struct gpt_params {
204214
int32_t port = 8080; // server listens on this network port
205215
int32_t timeout_read = 600; // http read timeout in seconds
206216
int32_t timeout_write = timeout_read; // http write timeout in seconds
207-
int32_t n_threads_http = -1; // number of threads to process HTTP requests
217+
int n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool)
208218

209219
std::string hostname = "127.0.0.1";
210220
std::string public_path = "";
@@ -277,6 +287,11 @@ void gpt_params_print_usage(int argc, char ** argv, const gpt_params & params);
277287

278288
std::string gpt_params_get_system_info(const gpt_params & params);
279289

290+
bool parse_cpu_range(const std::string& range, bool(&boolmask)[GGML_MAX_N_THREADS]);
291+
bool parse_cpu_mask(const std::string& mask, bool(&boolmask)[GGML_MAX_N_THREADS]);
292+
void postprocess_cpu_params(cpu_params& cpuparams, const cpu_params* role_model = nullptr);
293+
bool set_process_priority(enum ggml_sched_priority prio);
294+
280295
//
281296
// String utils
282297
//
@@ -327,8 +342,9 @@ struct llama_init_result {
327342

328343
struct llama_init_result llama_init_from_gpt_params(gpt_params & params);
329344

330-
struct llama_model_params llama_model_params_from_gpt_params (const gpt_params & params);
331-
struct llama_context_params llama_context_params_from_gpt_params(const gpt_params & params);
345+
struct llama_model_params llama_model_params_from_gpt_params (const gpt_params & params);
346+
struct llama_context_params llama_context_params_from_gpt_params (const gpt_params & params);
347+
struct ggml_threadpool_params ggml_threadpool_params_from_cpu_params(const cpu_params & params);
332348

333349
struct llama_model * llama_load_model_from_url(const char * model_url, const char * path_model, const char * hf_token, const struct llama_model_params & params);
334350
struct llama_model * llama_load_model_from_hf(const char * repo, const char * file, const char * path_model, const char * hf_token, const struct llama_model_params & params);

examples/baby-llama/baby-llama.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ constexpr float rms_norm_eps = 5e-6f;
1818
#endif
1919

2020
static void ggml_graph_compute_helper(std::vector<uint8_t> & buf, ggml_cgraph * graph, int n_threads) {
21-
struct ggml_cplan plan = ggml_graph_plan(graph, n_threads);
21+
struct ggml_cplan plan = ggml_graph_plan(graph, n_threads, nullptr);
2222

2323
if (plan.work_size > 0) {
2424
buf.resize(plan.work_size);

examples/benchmark/benchmark-matmult.cpp

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
#endif
2222

2323
static void ggml_graph_compute_helper(std::vector<uint8_t> & buf, ggml_cgraph * graph, int n_threads) {
24-
struct ggml_cplan plan = ggml_graph_plan(graph, n_threads);
24+
struct ggml_cplan plan = ggml_graph_plan(graph, n_threads, nullptr);
2525

2626
if (plan.work_size > 0) {
2727
buf.resize(plan.work_size);
@@ -54,7 +54,7 @@ static void tensor_dump(const ggml_tensor * tensor, const char * name) {
5454
#define TENSOR_DUMP(tensor) tensor_dump(tensor, #tensor)
5555

5656
struct benchmark_params_struct {
57-
int32_t n_threads = 1;
57+
int n_threads = 1;
5858
int32_t n_iterations = 10;
5959
};
6060

examples/cvector-generator/cvector-generator.cpp

+2-2
Original file line numberDiff line numberDiff line change
@@ -486,8 +486,8 @@ int main(int argc, char ** argv) {
486486
if (use_pca) {
487487
// run PCA
488488
PCA::pca_params pca_params;
489-
pca_params.n_threads = params.n_threads;
490-
pca_params.n_batch = params.n_pca_batch;
489+
pca_params.n_threads = params.cpuparams.n_threads;
490+
pca_params.n_batch = params.n_pca_batch;
491491
pca_params.n_iterations = params.n_pca_iterations;
492492
PCA::run_pca(pca_params, ctx_train.v_diff, ctx_train.v_final);
493493
} else {

examples/export-lora/export-lora.cpp

+1-1
Original file line numberDiff line numberDiff line change
@@ -410,7 +410,7 @@ int main(int argc, char ** argv) {
410410

411411
g_verbose = (params.verbosity == 1);
412412
try {
413-
lora_merge_ctx ctx(params.model, params.lora_adapters, params.lora_outfile, params.n_threads);
413+
lora_merge_ctx ctx(params.model, params.lora_adapters, params.lora_outfile, params.cpuparams.n_threads);
414414
ctx.run_merge();
415415
} catch (const std::exception & err) {
416416
fprintf(stderr, "%s\n", err.what());

0 commit comments

Comments
 (0)