Running LLaMA v2 with chat format. #507

viniciusarruda · 2023-07-19T21:53:23Z

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Generate all tokens without any error.

Current Behavior

Note: omitted my file path.

llama.cpp: loading model from models\llama-2-7b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 4303.65 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
  Title: The Beauty and Wonder of Hawaii

Introduction:
Hawaii, the most remote island chain in the world, is a place like no other. Located in the Pacific Ocean, it's a group of eight major islands that boast of stunning natural beauty, rich cultural heritage, and a unique history. From the rugged mountains to the sun-kissed beaches, Hawaii is a true paradise on earth.

Page 1: Overview of the Islands
Hawaii consists of eight main islands - Niihau, Kahoolawe, Lanai, Maui, Oahu, Molokai, and the Big Island. Each island has its unique charm and attractions. For instance, the Big Island is home to two active volcanoes - Mauna Kea and Haleakala, while Maui is famous for its stunning beaches and the Road to Hana. Oahu, the most populous island, houses the capital city Honolulu and is known for its vibrant culture.

Page 2: Natural Wonders
Hawaii's natural beauty is among the most impressive in the world. The Big Island boasts of two active volcanoes - Mauna Kea and Haleakala, while Maui is home to stunning beaches like Makena and Waileaks. Oahu's Manoa Falls is a popular spot for nature lovers, while Lanai's Polihua Beach offers a serene retreat from the hustle-bustle of life. The island chainTraceback (most recent call last):
  File "\test_chat_format.py", line 91, in <module>
    for token in completion:
  File "venv\lib\site-packages\llama_cpp\llama.py", line 713, in generate
    self.eval(tokens)
  File "venv\lib\site-packages\llama_cpp\llama.py", line 470, in eval
    self.scores[self.n_tokens + offset : self.n_tokens + n_tokens, :].reshape(
ValueError: could not broadcast input array from shape (32000,) into shape (0,)

Environment and Context

Running on Windows (PowerShell).
Python version: 3.10.9

Failure Information

When the error happens, the state of the variables in line L470 is shown in the following image:

which results in a zero-range index for the assignment.

Steps to Reproduce

Download the llama-2-7b-chat.ggmlv3.q2_K.bin model here
Run the following code:

Note: The code is basically this
I'm trying to use the correct chat format influenced by some discussions around this: [1, 2, 3]

from llama_cpp import Llama
import os
from typing_extensions import TypedDict, Literal
from typing import List, Optional

Role = Literal["system", "user", "assistant"]


class Message(TypedDict):
    role: Role
    content: str


B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""


def make_prompt_llama2(llm, messages: List[Message]) -> List[int]:
    if messages[0]["role"] != "system":
        messages = [
            {
                "role": "system",
                "content": DEFAULT_SYSTEM_PROMPT,
            }
        ] + messages

    messages = [
        {
            "role": messages[1]["role"],
            "content": B_SYS + messages[0]["content"] + E_SYS + messages[1]["content"],
        }
    ] + messages[2:]

    assert all([msg["role"] == "user" for msg in messages[::2]]) and all(
        [msg["role"] == "assistant" for msg in messages[1::2]]
    ), (
        "model only supports 'system', 'user' and 'assistant' roles, "
        "starting with 'system', then 'user' and alternating (u/a/u/a/u...)"
    )

    dialog_tokens = sum(
        [
            llm.tokenize(
                bytes(
                    f"{B_INST} {(prompt['content']).strip()} {E_INST} {(answer['content']).strip()} ",
                    "utf-8",
                ),
                add_bos=True,
            )
            + [llm.token_eos()]
            for prompt, answer in zip(
                messages[::2],
                messages[1::2],
            )
        ],
        [],
    )

    assert messages[-1]["role"] == "user", f"Last message must be from user, got {messages[-1]['role']}"

    dialog_tokens += llm.tokenize(
        bytes(f"{B_INST} {(messages[-1]['content']).strip()} {E_INST}", "utf-8"),
        add_bos=True,
    )

    return dialog_tokens


if __name__ == "__main__":
    llm = Llama(model_path=os.path.join("models", "llama-2-7b-chat.ggmlv3.q2_K.bin"))

    messages: List[Message] = [
        Message(role="user", content="How are you?"),
        Message(role="assistant", content="I'm fine!"),
        Message(role="user", content="Write a four page long essay about Hawaii."),
    ]

    tokens = make_prompt_llama2(
        llm,
        messages,
    )

    completion = llm.generate(
        tokens=tokens,
    )

    for token in completion:
        if token == llm.token_eos():
            break
        print(llm.detokenize([token]).decode("utf-8"), end="", flush=True)

viniciusarruda · 2023-07-20T21:04:03Z

For those interested, I've created this repo to handle LLaMA v2 chat completion. However, I still need to solve this issue.

abetlen · 2023-07-20T22:39:23Z

@viniciusarruda thank you for reporting this issue and setting up that repo, I'm working on making the chat completion formatting be configurable and will add that as an option.

viniciusarruda · 2023-07-20T22:44:13Z

Nice! I'm still trying to solve the issue presented here. It's giving a different token when comparing the original Meta tokenizer and GGLM. I think it is not related to this repo, since I'm calling the low-level API and the result is the same.

burkaygur · 2023-07-21T02:11:01Z

With this formatting, do you recommend the chat version or the regular version of Llama 2?

BartzLeon · 2023-07-21T17:52:00Z

Hey, so i get the same error, is this maybe related?

from langchain import LlamaCpp
from langchain.document_loaders import DirectoryLoader, TextLoader

local_model_path = '../llama/llama.cpp/models/13B/ggml-model-q4_0.bin'
loader = TextLoader('example.txt')

from langchain.embeddings import LlamaCppEmbeddings
llama_llm = LlamaCpp(model_path=local_model_path, n_ctx=2048, max_tokens=50)

from langchain.indexes import VectorstoreIndexCreator
llama_emb = LlamaCppEmbeddings(model_path=local_model_path)
index = VectorstoreIndexCreator(embedding=llama_emb).from_loaders([loader])

q = "What is the context about?"
print(index.query(q, llm=llama_llm))

When example.txt get to big i get
ValueError: could not broadcast input array from shape (8,) into shape (0,)
Do you know how i could solve this?

ffernn-dev · 2023-07-24T07:31:04Z

@viniciusarruda thank you for reporting this issue and setting up that repo, I'm working on making the chat completion formatting be configurable and will add that as an option.

Ah thank you so much! I appreciate your work a ton, it's made my life so much easier <3

viniciusarruda · 2023-07-24T23:44:11Z

Since this is related to tokenization, I'll link an issue I found with llama.cpp that could be related: ggml-org/llama.cpp#2310

viniciusarruda · 2023-07-27T22:47:00Z

To not fall into this error, I need to set and truncate the max_tokens.
I don't know why, but it is what I found.

So, in order to correctly use all the implemented stuff, is best to use the Llama.__call__ method, right @abetlen ?

viniciusarruda · 2023-07-27T23:35:34Z

Sorry, I was checking and it seems not to be possible. To include the LLaMA V2 chat completion functionality in this repo, the original llama.cpp tokenize function needs to have the ability to include the eos token in order to format the messages. Then, in this repo, the tokenize call needs to handle the chat format, by tokenizing it and making eos set to True when needed.

Also, since this is only valid for LLaMA 2, this repo needs to include a flag somehow to use the correct chat completion format.

ffernn-dev · 2023-08-22T08:55:15Z

Any news on this? I've managed to get something working with a sliding context window, but still getting this error once it goes out of context range.

phamkhactu · 2023-08-28T10:54:42Z

@viniciusarruda, @ffernn-dev Could you give some code reproduce for fix it? I have set truncate, max_tokens, but it is not working

qdore · 2023-08-29T11:20:02Z

I also encountered this problem.

ffernn-dev · 2023-08-29T11:23:19Z

@viniciusarruda, @ffernn-dev Could you give some code reproduce for fix it? I have set truncate, max_tokens, but it is not working

Hey sorry I've got exam season at the moment so I can't throw together a minimum reproduction but the code at the top of this issue should do it I think?

abetlen · 2023-08-29T21:40:51Z

@viniciusarruda sorry for the delay, will start working on this, needed to first merge in the gguf changes to handle getting an accurate model description, should be able to detect the model format now and implement a fix.

sofianhw · 2023-10-11T16:58:50Z

@viniciusarruda sorry for the delay, will start working on this, needed to first merge in the gguf changes to handle getting an accurate model description, should be able to detect the model format now and implement a fix.

Thanks for the update! I appreciate the effort you're putting into ensuring everything is set up correctly with the gguf changes. Can you confirm if this llama-2 chat completion format is still ongoing? I'm looking forward this, Thanks!

abetlen · 2023-11-08T06:26:19Z

This should be solved now with the Llama class chat_format parameter which can be used to select from common chat formats (including llama-2) and can be extended by hand using the chat_handler parameter.

gjmulder added the bug Something isn't working label Jul 30, 2023

abetlen closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running LLaMA v2 with chat format. #507

Running LLaMA v2 with chat format. #507

viniciusarruda commented Jul 19, 2023 •

edited

Loading

viniciusarruda commented Jul 20, 2023

abetlen commented Jul 20, 2023

viniciusarruda commented Jul 20, 2023 •

edited

Loading

burkaygur commented Jul 21, 2023

BartzLeon commented Jul 21, 2023

ffernn-dev commented Jul 24, 2023

viniciusarruda commented Jul 24, 2023

viniciusarruda commented Jul 27, 2023

viniciusarruda commented Jul 27, 2023

ffernn-dev commented Aug 22, 2023

phamkhactu commented Aug 28, 2023

qdore commented Aug 29, 2023

ffernn-dev commented Aug 29, 2023

abetlen commented Aug 29, 2023

sofianhw commented Oct 11, 2023

abetlen commented Nov 8, 2023 •

edited

Loading

Running LLaMA v2 with chat format. #507

Running LLaMA v2 with chat format. #507

Comments

viniciusarruda commented Jul 19, 2023 • edited Loading

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information

Steps to Reproduce

viniciusarruda commented Jul 20, 2023

abetlen commented Jul 20, 2023

viniciusarruda commented Jul 20, 2023 • edited Loading

burkaygur commented Jul 21, 2023

BartzLeon commented Jul 21, 2023

ffernn-dev commented Jul 24, 2023

viniciusarruda commented Jul 24, 2023

viniciusarruda commented Jul 27, 2023

viniciusarruda commented Jul 27, 2023

ffernn-dev commented Aug 22, 2023

phamkhactu commented Aug 28, 2023

qdore commented Aug 29, 2023

ffernn-dev commented Aug 29, 2023

abetlen commented Aug 29, 2023

sofianhw commented Oct 11, 2023

abetlen commented Nov 8, 2023 • edited Loading

viniciusarruda commented Jul 19, 2023 •

edited

Loading

viniciusarruda commented Jul 20, 2023 •

edited

Loading

abetlen commented Nov 8, 2023 •

edited

Loading