You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
where ``max_num_pages`` is the maximum number of pages used by all requests, ``page_size`` is the number of tokens
141
142
we fit into each page. ``2`` in single tensor storage means K/V (first one for keys, the second one for values).
142
143
143
144
.. note::
144
-
``indptr`` arrays across the flashinfer library should be of type ``int32``. Arrays of type ``int64`` can cause indexing errors. This is also true of the ``kv_page_indices`` and ``kv_last_page_lens`` arrays.
145
+
``indptr`` arrays across the flashinfer library should be of type ``int32``. Arrays of type ``int64`` can cause indexing errors. This is also true of the ``kv_page_indices`` and ``kv_last_page_lens`` arrays.
146
+
147
+
.. _mla-page-layout:
148
+
149
+
Multi-head Latent Attention Page Layout
150
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
151
+
152
+
Multi-head Latent Attention (MLA) is a new attention mechanism proposed in `DeepSeek v2 <https://arxiv.org/abs/2405.04434>`_ and was
153
+
used in later DeepSeek models. MLA unifies key cache and value cache into a single tensor, so there is no need to store them seperately.
154
+
Compared to multi-head atteniton or grouped query attention, the KV-Cache of MLA do not have the ``num_heads`` dimension,
155
+
so there is no distinction like ``NHD`` and ``HND`` layout.
156
+
157
+
MLA separates RoPE (Rotary Positional Encoding) dimensions and other head dimensions. We use ``kpe`` (key w/ positional encoding) and ``ckv`` (compressed key/value)
158
+
to name these two components. User can store them in a single Paged KV-Cache:
0 commit comments