-
Notifications
You must be signed in to change notification settings - Fork 11.4k
convert : write tensors in parallel #12837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Yes, I suspect that will be the case. To test, you can decrease the threshold in I can't think of any good strategy to make this better, probably just remove the multithread in my download logic. WDYT? |
I think it can still be relevant to multithread the download, especially in the case where individual download streams are slow and the biggest tensor ends up being the last to be written in full because of that. Having the download thread pool across downloads instead of per download might help with that, although it might require having a max range size per unit of work done by the download thread pool (but that may also cause way to many requests to happen per tensor if that max size doesn't also consider the size of the tensors). Not sure what's the best way to schedule those threads. Maybe it's fine to keep it per download like you did. The decision of how many download threads to use mostly depends on the network bandwidth per stream. There might be a way to detect when it's not faster to use more than N download streams. |
I've thought about some situations regarding multi-threaded download as in #12820. The numbered top-level statements are the two situations I'm considering, and they each get split further into two sub-consideration, one with a single thread per tensor downloaded, and the other with multiple threads per download. This all assumes that there are multiple tensor-writing threads.
So after writing the above, I think having a single download thread per write thread should be simpler and sufficient, at least when writing the tensors can be multi-threaded (as introduced here). |
Yes I agree, I think the multithread is only relevant to models having many big tensors, which is the case for all models > 3B. For smaller models, having it or not won't change much (I mean the time it takes to finish still change, but 20s vs 30s are not a big difference in this context) I assume that for very big models like Scout or Maverick, either having multithread per tensor or multithread per write, the time it takes will be the same, because most tensors are very big (on each layer, the expert FFN tensor alone is 21GB) I'm reverting my change in the other PR. |
I think the main difference in time would be for the last tensor which is more likely to be big for very big models. The worst case with a single download thread per tensor is when every other tensor has finished downloading and the last one is huge. |
Should I merge this now? (Given that you still have some TODOs unchecked) |
@ngxson I've tested more cases, and it seems to still give identical resulting files as I think I'll keep the default as 2 threads for now, because the convert script is still mostly I/O bound, and this at least allows interleaving some work when busy reshaping and/or quantizing/converting the types of the tensors. (and also the GIL limits some of the performance benefits for higher thread counts) |
Note that I did not yet test on a slow HDD, only an external SSD. I don't know to what extent slower drives are affected by read/write contention from multiple threads. This is probably relevant to test with different numbers of threads. |
Yeah that's a good question. I suspect that will vary depending on the OS and hardware config. But anw I think most modern OS will write to cache firstly anyway, so the performance difference won't be significant between multiple vs single thread writing Also I'm pretty sure that the write to file descriptor function in python is single thread Edit: no it's not single threaded, people still need to use Lock |
@ngxson I've tried this with the remote lazy tensors from #12820, and when there's a connection error the thread which does the download raises an exception but that doesn't stop the program, so it continues converting until it can't, then then hangs (since not all tensors were written (the queue is not empty)). Retries will need to be handled in the I'll rewrite the thread pool part with |
- gguf-py : handle (limited) retries for remote tensors
This makes
GGUFWriter.write_tensors_to_file
use a thread pool to write the tensors in parallel.This should help make conversion faster, since otherwise nothing else happened when reading, processing or writing the tensors. Now that can all happen at once.
convert_hf_to_gguf.py
has the new argument--threads
(or-t
, for short) to specify how many threads to use when writing the tensors.A queue of all the tensors, their offset and their respective file (so that sharding should work) is used to distribute the work in a small threadpool.
This was inspired by the discussion with @ngxson in #12820 (comment).
@ngxson I'm not sure how this will interact with multithreaded remote tensor fetching in #12820; there may be a lot of threads at once.
This should be compatible with #12820, although it will conflict because of the added arguments to the
Model
initializer inconvert_hf_to_gguf.py
(which is easy to resolve).TODO
Make sure to read the contributing guidelines before submitting a PR