Investigate storing results from ggml operations in F16 format

Currently, all `ggml` operations return the results in F32 format.

The goal of this task is to see if there is an elegant way to add support for keeping the results in F16 format.
This will ideally be passed as a parameter to the `ggml_context` and will also involve adding support for F16 operands in most of the existing operators. Ideally, we want to achieve this somehow without duplicating the entire code base.

Note that internal floating-point accumulators in the different operations can and should remain in F32 format.
It is just when we store the results into the `dst` tensor, we will cast them to F16.

Going to F16 intermediate results would reduce significantly the memory pressure and could lead to significant speed improvements. Hopefully, the loss in quality would be marginal. But in any case, there will always be the option of switching back to full F32 precision.

I am looking for suggestions and initial prototypes of how we can achieve this in an elegant way.

Related:

- #909 
- #951 

Edit: An initial quick and dirty implementation that simply goes over the existing LLaMA related operators and changes the return type to F16 would be useful to determine if such functionality is useful and how much performance gain we can expect. If it is worth, then we can think in more details about how exactly to support it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate storing results from ggml operations in F16 format #959

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate storing results from ggml operations in F16 format #959

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions