[Backend][Common] some concept explaination #7911

airMeng · 2024-06-13T02:10:24Z

airMeng
Jun 13, 2024
Collaborator

Hi @slaren @ggerganov I have some questions for the current common backend and think here might be a good place to ask, helping more developers to avoid additional effort.

according to #7806 (comment)

reuse the extra of the parent tensor instead of allocating a new one

but from https://github.com/ggerganov/llama.cpp/blob/f578b86b2123d0f92afbaa98a031df4d4464e582/ggml-cuda.cu#L422-L425, there is no extras needed at all.

the extras are allocated in the buffer of the KV cache

so according to CUDA, extras is already allocated in the common part. Can you point out which line does this?

To prevent this it is necessary either to allocate the views in the compute buffer (what happened before), or simply avoid allocating extras for views, which is what this change does.

I think the views should be allocated in view_src. from https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu I don't see any process to allocate, so it is maintained by the common, too?

according to #7436 (comment)

All uses of ggml_tensor::backend should be removed. The SYCL backend can assume that all tensors it receives are allocated in a SYCL buffer.

However, when I refactor SYCL backend, https://github.com/ggerganov/llama.cpp/pull/7710/files#r1637387064, some tensors will still be allocated on CPU so I must delete the assert then the performance is hurt greatly.

Answered by slaren

Jun 13, 2024

First, a brief description of what the extras actually are and what purpose they serve:

Extras are an object associated to every tensor that backends can use to store any information that they may need in addition to the buffer and data pointers. For example, the CUDA backend uses extras in the split buffers to store the pointers of each device where the data of the tensor is being stored.

Extras are set during the call to the init_tensor function of the buffer interface, and freed, or returned to the pool, during the call to the buffer reset function. There isn't an individual call to reset for every tensor allocated in this buffer, therefore the backend must keep track of all the extras…

View full answer

airMeng · 2024-06-13T02:11:03Z

airMeng
Jun 13, 2024
Collaborator Author

@luoyu-intel @NeoZhangJianyu @AidanBeltonS FYI

0 replies

slaren · 2024-06-13T13:22:51Z

slaren
Jun 13, 2024
Maintainer

First, a brief description of what the extras actually are and what purpose they serve:

Extras are an object associated to every tensor that backends can use to store any information that they may need in addition to the buffer and data pointers. For example, the CUDA backend uses extras in the split buffers to store the pointers of each device where the data of the tensor is being stored.

Extras are set during the call to the init_tensor function of the buffer interface, and freed, or returned to the pool, during the call to the buffer reset function. There isn't an individual call to reset for every tensor allocated in this buffer, therefore the backend must keep track of all the extras allocated within a buffer, and be ready to release them during the call to reset. Some backends, based on an early implementation of the CUDA backend, use a circular buffer with a fixed number of extras. This is unreliable and should be avoided.

Extras are completely optional and should be avoided if not strictly necessary. They add another resource that needs to be managed by the backends, and using an extra will prevent the backend from working remotely with the RPC backend, because the calls to init_tensor are skipped, since having to initialize thousands of tensors in each graph evaluation, each requiring a round trip to and from the server, would be impractical.

Now about the problem with views and extras:

Since the extras are managed by the buffer object, it raises the question of what buffer object should own the extras used in views during computation. Extras for views cannot be allocated from the buffer that owns the original tensor (ie. the view_src) because these buffers may be static and its tensors are never freed. Examples of this include views of weights and views of the KV cache tensors, the reset function of these buffers is not called after every evaluation because they contain tensors that have a lifetime that extends beyond a single evaluation. The previous workaround to deal with this was to initialize the views using the temporary compute buffer instead. This results in an inconsistency: the buffer attribute of ggml_tensor is meant to represent the buffer where the tensor is allocated, but the buffer object of views would not point to the buffer where the data of the tensor is actually allocated, but rather to an unrelated buffer used to store the intermediate results of the computation. Furthermore, there is no guarantee or requirement that the compute buffer is of the same type than the buffer used to create the base tensor in the first place. For example, CUDA split buffers are only used to allocate weights, but they are not used in the compute buffer. So we may end in a situation where it is impossible to initialize a view because we don't have a buffer of the correct type to initialize it in.

To address all of this, after #7640, views are initialized using the buffer of their parent tensor. Since the buffer of the parent tensor is not always reset, for all practical purposes this means that backends cannot allocate a new extra for views. Instead, backends will need to reuse the extra of the parent tensor for its views, or to avoid using extras entirely for views.

About the SYCL backend: the extras that are used in the SYCL backend were inherited from an early implementation of the CUDA backend that required extras even for tensors allocated in a single device. Since the SYCL backend does not support split buffers or tensor parallelism, these extras serve no purpose and should be removed entirely. Instead, the device address of the tensors can be obtained from the data attribute of the tensors, which is initialized by ggml-backend and ggml-alloc using the base address returned by the get_base function of the buffer.

About tensors allocated on the CPU: this simply does not happen anymore. All the tensors received by a backend are allocated in a buffer type that the backend explicitly reports as supported in the supports_buft function of the backend object. The backend attribute of ggml_tensor has been deprecated and it is not set in ggml-backend anymore. Backends that support multiple buffer types must use the buffer member of ggml_tensor to determine the storage location of tensors.

2 replies

lslusarczyk Apr 14, 2025

Dear @slaren ,

the buffer attribute of ggml_tensor is meant to represent the buffer where the tensor is allocated,

the device address of the tensors can be obtained from the data attribute of the tensors

Could you please explain what is the difference between data and buffer attribute? Above 2 definitions looks the same to me.

slaren Apr 14, 2025
Maintainer

buffer is a pointer to ggml_backend_buffer that contains the description of the buffer and the backend interface functions. data is effectively the address or offset within this buffer where the tensor is allocated.

airMeng · 2024-06-13T13:55:50Z

airMeng
Jun 13, 2024
Collaborator Author

To address all of this, after #7640, views are initialized using the buffer of their parent tensor

I have to say this is an extraordinary design, just like the first shock that I found GGML using the ne[0] for the most inner dimensions, instead the ne[-1] which is used more broadly. But quite smart solution. Thank you for your detailed instruction.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend][Common] some concept explaination #7911

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[Backend][Common] some concept explaination #7911

airMeng Jun 13, 2024 Collaborator

Replies: 3 comments · 2 replies

airMeng Jun 13, 2024 Collaborator Author

slaren Jun 13, 2024 Maintainer

lslusarczyk Apr 14, 2025

slaren Apr 14, 2025 Maintainer

airMeng Jun 13, 2024 Collaborator Author

airMeng
Jun 13, 2024
Collaborator

Replies: 3 comments 2 replies

airMeng
Jun 13, 2024
Collaborator Author

slaren
Jun 13, 2024
Maintainer

slaren Apr 14, 2025
Maintainer

airMeng
Jun 13, 2024
Collaborator Author