-
Notifications
You must be signed in to change notification settings - Fork 11.8k
WIP: Use DirectStorage with CUDA interop to more efficient load tensors #7796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
When I previously investigated this in this PR #1483 I found that cuFile is not significantly faster but it's always possible that my implementation was just bad. In any case, be aware that cuFile is incompatible with some filesystems; on my Linux machine I could for instance only load models stored on EXT4 partitions but not models stored on Btrfs partitions. |
Your PR doesn't state which NVMe devices you used for benchmarking with cuFile. On a modern x64 system a single CPU core can process ~25gb/s max with specialized memcpy operations and most likely ~12gb/s with libcs memcpy. NVMe->GPU has quite a few reads/writes through host memory with the default pipeline limiting perf if a single thread is used for IO: PCI->host kernel space (write), kernel->disk cache (rw), disk cache -> user space (rw), user space -> pinned (CUDA, rw), pinned -> GPU (read) Assuming the DMA reads directly into the disk cache (or the disk cache is bypassed) the best case we have is 2xDMA + 4xCPU read or write. Assuming DMA is async and a ~12gb/s memcpy implementation is used one would get 3GB/s max perf using a single thread which is close to the 2.66gb/s I've been seeing (40GB in 15s). Instead of DirectStorage I could imagine another pipeline as well which uses multiple threads copying from mmapped memory to pinned memory and a single thread spawing the cudaMemcpys for the uploads. Yet using mmapped memory in its current form has another problem: It commits pages and thus increases physical memory utilization for tensors which are actually required only on the GPU. This could be solved easily by closing the mmap handles after uploading the data to the GPU. |
I think we should give it a try using multiple threads with direct I/O into a pinned buffer. I think that would remove all unnecessary copies. If the overall performance is similar, it would save us a significant amount of complexity from having to implement backend-specific interfaces.
This is done on POSIX systems when possible by calling |
Have you tried how much perf is lost when not mapping the whole file at once, but have one mapping for each tensor? This would allow unmapping tensors which are no longer required on the CPU. |
I have not tried it, but we absolutely could do something like that. Mapping only the fraction of the file used on the CPU, and using direct I/O or DirectStorage/cuFile/etc for the offloaded fraction could be a good default behavior. |
Wouldn't partial mapping require the same infrastructure as DirectStorage? Instead of relying on getting a host pointer in tensor_set the backend would get a file reference and be responsible to map/unmap data or use direct io as desired per tensor? |
I don't think it would be exactly the same. Moving the mmap code into the backends would result in code duplication in systems with unified memory, because in that case mmap can be used for both for the CPU and the GPU backends. Currently this is only Metal on Apple silicon, but we have experimented supporting mmap with CUDA and HIP. Here we tested that it does work on AMD iGPUs. I would also expect it to work on NVIDIA systems like tegra. I would also be wary about creating one mapping per tensor, even small models have hundreds of tensors, and above a thousand tensors for larger models. In most cases we should be able to consolidate the mappings into one or two mappings per backend. I would also expect to be able to use the same implementation of direct I/O for all the backends, except maybe the pinned buffer allocation. |
I was using a SanDisk Ultra 3D NVMe. |
That said, I wouldn't be against moving the entire loading logic, including mmap and possible direct I/O, to ggml or ggml-backend. @ggerganov has disagreed in the past, but I think this should be dealt by the library so that all applications can benefit from it (in this context, ggml is the library and llama.cpp is the application). |
This disk can read 3500MB/s which means perf improvements are not expected with DS.
On Windows it's legal to call MapViewOfFile twice and one gets two different pointers with the same physical backing store. What I do not know yet if unmapping a view of the file will also free all the physical backing store allocated for the view only. |
We can reconsider. The main reason to disagree in the beginning was because I wasn't familiar with On the topic of this PR - don't have much to add for now. I agree that the Direct I/O approach suggested earlier should be explored because it seems the implementation would carry less baggage (i.e. dependencies and headers) and it would be more portable |
I wrote a tiny benchmark today to determine the real IO throughput reading into system memory with the following results: I ran two iterations on a system with 128GB of memory and a Corsair T705 NVMe drive. The system is large enough to keep one 39GB file completely cached whereas it is not large enough to keep 5 files of this size cached. I ran two iterations of the benchmark, one where each benchmark read a different file to ensure that the FS cache is not utilized and one where all benchmarks read the same file with a single warmup iteration. As result, std::fstream is pretty bad, fread is okay once data is in the FS cache, but still only 55% as fast as using the Win32 API to read data which is not yet in the FS cache. mmap is slower than direct IO as well. Caching improves things a lot, yet unbuffered IO with the WIN32 API on a fast NVMe drive is still the fastest option. The outcome of this benchmark is that for a non-raid NVMe drive unbuffered file IO is already quite good if file IO into pinned memory and host->device transfers can be pipelined to run overlapped instead of serially.
|
@slaren I've prototyped a small piece of code which goes the direct file io path directly to a set of pinned memory buffers which are used in round robin style and achieved ~8.5gb/s with 4 pinned memory buffers, each with a size of 1MB. While this is still slower than using DS it is a good intermediate step to faster IO BW and also to prepare an API which allows the use of DirectStorage in the future. To achieve the bandwidth the file has to be openend with CreateFile(A) on Windows while ggml_fopen currently supports only C-a C-style FILE*. My proposal for the API changes would be:
The following symbols can potentially be removed: llama_file, llamamap For the async upload one would have to add In the future, in case DS will be implemented, one would potentially have to add Do you like those API changes or do you have suggestions how to do this more efficient within the ggml framework? |
Besides being more efficient with regards to perf (measure on Windows) the other benefit of this change will be that the issues with commited memory on windows will be gone as well since data is read into temporary pinned memory buffers only. |
If I understand correctly, currently what you are doing is reading the file from a single thread with a loop similar to this: cudaEventSynchronize(buffer_event[i]);
read_file(file, buffer[i], ..);
cudaMemcpyAsync(.., buffer[i], .., stream);
cudaEventRecord(buffer_event[i], stream);
i = (i + 1) % n_bufs; Is this correct? Would there be any advantage to using multiple threads? |
That is correct. Here is the prototype I used for benchmarking. I haven't used multiple threads yet. I suggest delaying experimenting with multiple threads until the basic algorithm is implemented. For multiple threads questions arrives like do we want to have one worker thread for each GPU or flush the memcpy in the main thread? If we have one thread per GPU new questions will arrive like, is it still necessary to call cudaSetDevice before every CUDA call? What is the overhead of putting kernel launches in a queue per thread vs. executing them in a local thread? Prototype
|
I think there are a few problems with the proposed API. It would require backends to implement a function that is not really necessary, since ggml-backend already exposes (almost) all the functionality necessary to implement this. It is also not clear to me how to free the temporary buffers and events used during loading. ggml-backend is missing a generic way to allocate pinned buffers, but it is available in llama.cpp with calls to the specific implementations of each backend. It also think it would be preferable to have a higher level API in ggml that can load multiple tensors in a single call and hide most of the details of loading tensors from gguf files, but it is not clear how that would look like. My conclusion is that we need more time to figure how to move this functionality to ggml. For the time being, this funcionality could be implemented in llama.cpp instead. We would need to extend // allocate resources
ggml_backend_buffer_t host_buffer = ggml_backend_buft_alloc_buffer(llama_default_buffer_type_cpu(true), size);
ggml_backend_t backend = ggml_backend_cuda_init(device);
ggml_backend_event_t event = ggml_backend_event_new(backend);
void * host_buffer_ptr = ggml_backend_buffer_get_base(host_buffer);
// copy loop
ggml_backend_event_synchronize(event);
file->read(host_buffer_ptr, ..);
ggml_backend_tensor_set_async(backend, tensor, host_buffer_ptr, ...);
ggml_backend_event_record(event);
// wait for all copies to finish
ggml_backend_synchronize(backend);
// free resources
ggml_backend_buffer_free(pinned_buffer);
ggml_backend_event_free(event);
ggml_backend_free(backend); Currently we are missing a function in llama.cpp to initialize a |
On Windows File I've seen file IO in the range of 3GB/s - 4GB/s using a single IO thread and mmaped files. The newest NVMe drives can do >14GB/s and good raid controllers can read with to ~55GB/s. To get read speeds close to NVMe raid speed without stressing CPU RAM bw DirectStorage can be used.
On Linux CUDA supports the cuFile API, on Windows one currently has to use DirectStorage for DX with CUDA interop as done in this POC. I've seen speedups of 3x over mmap (15s->5s) when streaming from a single NVMe drive. There is a code path which can stream from two NVMe drives at once (lacking a RAID) which improves the speedup even more, but is currently limited by DX/CUDA interop limitations in combination with llama.cpp.
For now I have hijacked the non-mmaped code path and pass a struct passing the file information to the tensor_set function. For a clean solution it'd be good if there was a way to have some abstract way to import tensor data from a ggml file handle which depends on the backend. A special file handle is created because the different DirectStorage APIs all have special ways to open a file for the DirectStorage operation.
The most simple interface one could imagine would be
ggml_tensor_set(filename, offset, size).
. Passing a filename only would require opening the file on each set operation which is potentially more expensive than the read itself. Thus my proposal is to have two new functionsSince IO ops are completely asynchronous eventually there must be a way to synchronize all file io, or at least to add an event to the file io queue to ensure that all file io is done. Currently the hack is using a nullptr passed as file to trigger this sync.