Skip to content

Running LLaMA v2 with chat format. #507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
viniciusarruda opened this issue Jul 19, 2023 · 16 comments
Closed
4 tasks done

Running LLaMA v2 with chat format. #507

viniciusarruda opened this issue Jul 19, 2023 · 16 comments
Labels
bug Something isn't working

Comments

@viniciusarruda
Copy link
Contributor

viniciusarruda commented Jul 19, 2023

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Generate all tokens without any error.

Current Behavior

Note: omitted my file path.

llama.cpp: loading model from models\llama-2-7b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 4303.65 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
  Title: The Beauty and Wonder of Hawaii

Introduction:
Hawaii, the most remote island chain in the world, is a place like no other. Located in the Pacific Ocean, it's a group of eight major islands that boast of stunning natural beauty, rich cultural heritage, and a unique history. From the rugged mountains to the sun-kissed beaches, Hawaii is a true paradise on earth.

Page 1: Overview of the Islands
Hawaii consists of eight main islands - Niihau, Kahoolawe, Lanai, Maui, Oahu, Molokai, and the Big Island. Each island has its unique charm and attractions. For instance, the Big Island is home to two active volcanoes - Mauna Kea and Haleakala, while Maui is famous for its stunning beaches and the Road to Hana. Oahu, the most populous island, houses the capital city Honolulu and is known for its vibrant culture.

Page 2: Natural Wonders
Hawaii's natural beauty is among the most impressive in the world. The Big Island boasts of two active volcanoes - Mauna Kea and Haleakala, while Maui is home to stunning beaches like Makena and Waileaks. Oahu's Manoa Falls is a popular spot for nature lovers, while Lanai's Polihua Beach offers a serene retreat from the hustle-bustle of life. The island chainTraceback (most recent call last):
  File "\test_chat_format.py", line 91, in <module>
    for token in completion:
  File "venv\lib\site-packages\llama_cpp\llama.py", line 713, in generate
    self.eval(tokens)
  File "venv\lib\site-packages\llama_cpp\llama.py", line 470, in eval
    self.scores[self.n_tokens + offset : self.n_tokens + n_tokens, :].reshape(
ValueError: could not broadcast input array from shape (32000,) into shape (0,)

Environment and Context

Running on Windows (PowerShell).
Python version: 3.10.9

Failure Information

When the error happens, the state of the variables in line L470 is shown in the following image:

image

which results in a zero-range index for the assignment.

Steps to Reproduce

Download the llama-2-7b-chat.ggmlv3.q2_K.bin model here
Run the following code:

Note: The code is basically this
I'm trying to use the correct chat format influenced by some discussions around this: [1, 2, 3]

from llama_cpp import Llama
import os
from typing_extensions import TypedDict, Literal
from typing import List, Optional

Role = Literal["system", "user", "assistant"]


class Message(TypedDict):
    role: Role
    content: str


B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""


def make_prompt_llama2(llm, messages: List[Message]) -> List[int]:
    if messages[0]["role"] != "system":
        messages = [
            {
                "role": "system",
                "content": DEFAULT_SYSTEM_PROMPT,
            }
        ] + messages

    messages = [
        {
            "role": messages[1]["role"],
            "content": B_SYS + messages[0]["content"] + E_SYS + messages[1]["content"],
        }
    ] + messages[2:]

    assert all([msg["role"] == "user" for msg in messages[::2]]) and all(
        [msg["role"] == "assistant" for msg in messages[1::2]]
    ), (
        "model only supports 'system', 'user' and 'assistant' roles, "
        "starting with 'system', then 'user' and alternating (u/a/u/a/u...)"
    )

    dialog_tokens = sum(
        [
            llm.tokenize(
                bytes(
                    f"{B_INST} {(prompt['content']).strip()} {E_INST} {(answer['content']).strip()} ",
                    "utf-8",
                ),
                add_bos=True,
            )
            + [llm.token_eos()]
            for prompt, answer in zip(
                messages[::2],
                messages[1::2],
            )
        ],
        [],
    )

    assert messages[-1]["role"] == "user", f"Last message must be from user, got {messages[-1]['role']}"

    dialog_tokens += llm.tokenize(
        bytes(f"{B_INST} {(messages[-1]['content']).strip()} {E_INST}", "utf-8"),
        add_bos=True,
    )

    return dialog_tokens


if __name__ == "__main__":
    llm = Llama(model_path=os.path.join("models", "llama-2-7b-chat.ggmlv3.q2_K.bin"))

    messages: List[Message] = [
        Message(role="user", content="How are you?"),
        Message(role="assistant", content="I'm fine!"),
        Message(role="user", content="Write a four page long essay about Hawaii."),
    ]

    tokens = make_prompt_llama2(
        llm,
        messages,
    )

    completion = llm.generate(
        tokens=tokens,
    )

    for token in completion:
        if token == llm.token_eos():
            break
        print(llm.detokenize([token]).decode("utf-8"), end="", flush=True)
@viniciusarruda
Copy link
Contributor Author

For those interested, I've created this repo to handle LLaMA v2 chat completion. However, I still need to solve this issue.

@abetlen
Copy link
Owner

abetlen commented Jul 20, 2023

@viniciusarruda thank you for reporting this issue and setting up that repo, I'm working on making the chat completion formatting be configurable and will add that as an option.

@viniciusarruda
Copy link
Contributor Author

viniciusarruda commented Jul 20, 2023

Nice! I'm still trying to solve the issue presented here. It's giving a different token when comparing the original Meta tokenizer and GGLM. I think it is not related to this repo, since I'm calling the low-level API and the result is the same.

@burkaygur
Copy link

With this formatting, do you recommend the chat version or the regular version of Llama 2?

@BartzLeon
Copy link

Hey, so i get the same error, is this maybe related?

from langchain import LlamaCpp
from langchain.document_loaders import DirectoryLoader, TextLoader

local_model_path = '../llama/llama.cpp/models/13B/ggml-model-q4_0.bin'
loader = TextLoader('example.txt')

from langchain.embeddings import LlamaCppEmbeddings
llama_llm = LlamaCpp(model_path=local_model_path, n_ctx=2048, max_tokens=50)

from langchain.indexes import VectorstoreIndexCreator
llama_emb = LlamaCppEmbeddings(model_path=local_model_path)
index = VectorstoreIndexCreator(embedding=llama_emb).from_loaders([loader])

q = "What is the context about?"
print(index.query(q, llm=llama_llm))

When example.txt get to big i get
ValueError: could not broadcast input array from shape (8,) into shape (0,)
Do you know how i could solve this?

@ffernn-dev
Copy link

@viniciusarruda thank you for reporting this issue and setting up that repo, I'm working on making the chat completion formatting be configurable and will add that as an option.

Ah thank you so much! I appreciate your work a ton, it's made my life so much easier <3

@viniciusarruda
Copy link
Contributor Author

Since this is related to tokenization, I'll link an issue I found with llama.cpp that could be related: ggml-org/llama.cpp#2310

@viniciusarruda
Copy link
Contributor Author

To not fall into this error, I need to set and truncate the max_tokens.
I don't know why, but it is what I found.

So, in order to correctly use all the implemented stuff, is best to use the Llama.__call__ method, right @abetlen ?

@viniciusarruda
Copy link
Contributor Author

Sorry, I was checking and it seems not to be possible. To include the LLaMA V2 chat completion functionality in this repo, the original llama.cpp tokenize function needs to have the ability to include the eos token in order to format the messages. Then, in this repo, the tokenize call needs to handle the chat format, by tokenizing it and making eos set to True when needed.

Also, since this is only valid for LLaMA 2, this repo needs to include a flag somehow to use the correct chat completion format.

@gjmulder gjmulder added the bug Something isn't working label Jul 30, 2023
@ffernn-dev
Copy link

Any news on this? I've managed to get something working with a sliding context window, but still getting this error once it goes out of context range.

@phamkhactu
Copy link

@viniciusarruda, @ffernn-dev Could you give some code reproduce for fix it? I have set truncate, max_tokens, but it is not working

@qdore
Copy link

qdore commented Aug 29, 2023

I also encountered this problem.

@ffernn-dev
Copy link

@viniciusarruda, @ffernn-dev Could you give some code reproduce for fix it? I have set truncate, max_tokens, but it is not working

Hey sorry I've got exam season at the moment so I can't throw together a minimum reproduction but the code at the top of this issue should do it I think?

@abetlen
Copy link
Owner

abetlen commented Aug 29, 2023

@viniciusarruda sorry for the delay, will start working on this, needed to first merge in the gguf changes to handle getting an accurate model description, should be able to detect the model format now and implement a fix.

@sofianhw
Copy link

@viniciusarruda sorry for the delay, will start working on this, needed to first merge in the gguf changes to handle getting an accurate model description, should be able to detect the model format now and implement a fix.

Thanks for the update! I appreciate the effort you're putting into ensuring everything is set up correctly with the gguf changes. Can you confirm if this llama-2 chat completion format is still ongoing? I'm looking forward this, Thanks!

@abetlen
Copy link
Owner

abetlen commented Nov 8, 2023

This should be solved now with the Llama class chat_format parameter which can be used to select from common chat formats (including llama-2) and can be extended by hand using the chat_handler parameter.

@abetlen abetlen closed this as completed Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants