-
Hi @slaren @ggerganov I have some questions for the current common backend and think here might be a good place to ask, helping more developers to avoid additional effort. according to #7806 (comment)
but from https://github.com/ggerganov/llama.cpp/blob/f578b86b2123d0f92afbaa98a031df4d4464e582/ggml-cuda.cu#L422-L425, there is no extras needed at all.
so according to CUDA, extras is already allocated in the common part. Can you point out which line does this?
I think the views should be allocated in view_src. from https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu I don't see any process to allocate, so it is maintained by the common, too? according to #7436 (comment)
However, when I refactor SYCL backend, https://github.com/ggerganov/llama.cpp/pull/7710/files#r1637387064, some tensors will still be allocated on CPU so I must delete the assert then the performance is hurt greatly. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
First, a brief description of what the extras actually are and what purpose they serve: Extras are an object associated to every tensor that backends can use to store any information that they may need in addition to the Extras are set during the call to the Extras are completely optional and should be avoided if not strictly necessary. They add another resource that needs to be managed by the backends, and using an extra will prevent the backend from working remotely with the RPC backend, because the calls to Now about the problem with views and extras: Since the extras are managed by the buffer object, it raises the question of what buffer object should own the extras used in views during computation. Extras for views cannot be allocated from the buffer that owns the original tensor (ie. the To address all of this, after #7640, views are initialized using the buffer of their parent tensor. Since the buffer of the parent tensor is not always reset, for all practical purposes this means that backends cannot allocate a new extra for views. Instead, backends will need to reuse the extra of the parent tensor for its views, or to avoid using extras entirely for views. About the SYCL backend: the extras that are used in the SYCL backend were inherited from an early implementation of the CUDA backend that required extras even for tensors allocated in a single device. Since the SYCL backend does not support split buffers or tensor parallelism, these extras serve no purpose and should be removed entirely. Instead, the device address of the tensors can be obtained from the About tensors allocated on the CPU: this simply does not happen anymore. All the tensors received by a backend are allocated in a buffer type that the backend explicitly reports as supported in the |
Beta Was this translation helpful? Give feedback.
-
I have to say this is an extraordinary design, just like the first shock that I found GGML using the ne[0] for the most inner dimensions, instead the ne[-1] which is used more broadly. But quite smart solution. Thank you for your detailed instruction. |
Beta Was this translation helpful? Give feedback.
First, a brief description of what the extras actually are and what purpose they serve:
Extras are an object associated to every tensor that backends can use to store any information that they may need in addition to the
buffer
anddata
pointers. For example, the CUDA backend uses extras in the split buffers to store the pointers of each device where the data of the tensor is being stored.Extras are set during the call to the
init_tensor
function of the buffer interface, and freed, or returned to the pool, during the call to the bufferreset
function. There isn't an individual call toreset
for every tensor allocated in this buffer, therefore the backend must keep track of all the extras…