Add grammar-based sampling #572

c0sogi · 2023-08-05T05:59:22Z

Recently, grammar based sampling was merged into llama.cpp.
However, there's no explicit parser API we can currently use in Python. Therefore, I translated the grammar-parser.cpp into llama_grammar.py.

I've tested it using vendor/llama.cpp/grammars/json.gbnf, and the output of parsed grammar was perfectly the same as compiled version. See the example below. I hope this will help implementing function call someday!

Test code:

from llama_cpp.llama import Llama


llm = Llama(
    model_path="./models/llama-7B/ggml-model.bin",
    grammar="./vendor/llama.cpp/grammars/json.gbnf",
)
print(llm(prompt="")["choices"][0]["text"])

Output:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9
llama.cpp: loading model from ./models/llama-7B/ggml-model.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 240
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2302.96 MB (+  162.50 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/29 layers to GPU
llama_model_load_internal: total VRAM used: 288 MB
llama_new_context_with_model: kv self size  =  162.50 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
from_string grammar:
root ::= object
object ::= [{] ws object_11 [}]
value ::= object | array | string | number | boolean | [n] [u] [l] [l]
array ::= [[] ws array_15 []]
string ::= ["] string_18 ["] ws
number ::= number_19 number_20 ws
boolean ::= boolean_21 ws
ws ::= ws_23
object_8 ::= string [:] ws value object_10
object_9 ::= [,] ws string [:] ws value
object_10 ::= object_9 object_10 |
object_11 ::= object_8 |
array_12 ::= value array_14
array_13 ::= [,] ws value
array_14 ::= array_13 array_14 |
array_15 ::= array_12 |
string_16 ::= [^"\] | [\] string_17
string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]
string_18 ::= string_16 string_18 |
number_19 ::= [-] |
number_20 ::= [0-9] number_20 | [0-9]
boolean_21 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e]
ws_22 ::= [ <U+0009><U+000A>] ws
ws_23 ::= ws_22 |


llama_print_timings:        load time =   213.76 ms
llama_print_timings:      sample time =   192.72 ms /    28 runs   (    6.88 ms per token,   145.29 tokens per second)
llama_print_timings: prompt eval time =   213.74 ms /     2 tokens (  106.87 ms per token,     9.36 tokens per second)
llama_print_timings:        eval time =  1346.34 ms /    27 runs   (   49.86 ms per token,    20.05 tokens per second)
llama_print_timings:       total time =  1797.62 ms
{
 "name": "Ralph",
 "age": 50,
 "gender": "male"
}

Once this merged, I will continue PR function call feature example.
This will parse real python function into grammar string. See the test result below

if __name__ == "__main__":
    from llama_cpp import LlamaGrammar, Llama
    
    
    # Define a python function and parse it into a grammar
    def get_current_weather(
        location: Annotated[
            str,
            "The location to get the current weather for",
        ],
        unit: Annotated[
            str,
            "The unit of temperature to return",
            ["fahrenheit", "celsius"],
        ],
        source: Annotated[
            str,
            "The source of the weather information",
            ["openweathermap", "weatherapi"],
        ] = "openweathermap",
    ):
        """Get the current weather in a given location"""
    
    model_path = "C:/Users/sdml/Desktop/orca-mini-3b.ggmlv3.q4_0.bin"
    grammar: str = SchemaConverter.from_function(get_current_weather)
    llama_grammar = LlamaGrammar.from_string(grammar)
    llm = Llama(model_path)
    llm.grammar = llama_grammar
    print(llm(prompt="### User: What is the weather in London today? ### Assistant:")["choices"][0]["text"])
    
    # Output:
    # { "location": "London", "source": "openweathermap","unit" : "celsius"}

c0sogi · 2023-08-07T06:24:32Z

After a further review, I noticed that the grammar sampling doesn't work after one completion.
This seems like intended behavior for non-interactive mode. Therefore, I added few lines of code that reset grammar for every generation, so grammar sampling can be applied for every completion.

            if (n_past > 0) {
                if (is_interacting) {
                    // reset grammar state if we're restarting generation
                    if (grammar != NULL) {
                        llama_grammar_free(grammar);

                        std::vector<const llama_grammar_element *> grammar_rules(
                            parsed_grammar.c_rules());
                        grammar = llama_grammar_init(
                            grammar_rules.data(), grammar_rules.size(),
                            parsed_grammar.symbol_ids.at("root"));
                    }
                }
                is_interacting = false;
            }
        }

class LlamaGrammar:
    ...
    def reset(self) -> None:
        llama_cpp.llama_grammar_free(self.grammar)
        self.grammar = llama_cpp.llama_grammar_init(
            self.rules, self.n_rules, self.start_rule_index
        )

class Llama:
    def generate(...):
         ...
        if reset:
            self.reset()

        if self.grammar is not None:
            self.grammar.reset()

        while True:
            self.eval(tokens)
        ...

abetlen · 2023-08-08T18:42:14Z

@c0sogi amazing work, I was planning on pulling in the grammar based sampling but ran into an issue with the parser. I'll merge this in but I'll likely move the grammar directly to the generate and completion calls, this way a different grammar can be used for the same Llama model.

c0sogi · 2023-08-09T04:13:50Z

@abetlen My mistake. These four enum-related parts

    if rule.empty() or rule.back().type != llama_gretype.LLAMA_GRETYPE_END.value:
        raise RuntimeError(
            "malformed rule, does not end with LLAMA_GRETYPE_END: " + str(rule_id)
        )
...

        if case is llama_gretype.LLAMA_GRETYPE_END.value:
            raise RuntimeError("unexpected end of rule: " + str(rule_id) + "," + str(i))

...

def is_char_element(elem: LlamaGrammarElement) -> bool:
    return elem.type in (
        llama_gretype.LLAMA_GRETYPE_CHAR.value,
        llama_gretype.LLAMA_GRETYPE_CHAR_NOT.value,
        llama_gretype.LLAMA_GRETYPE_CHAR_ALT.value,
        llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER.value,
    )
...

            if rule[i + 1].type in (
                llama_gretype.LLAMA_GRETYPE_CHAR_ALT.value,
                llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER.value,
            ):

should be these:

    if rule.empty() or rule.back().type is not llama_gretype.LLAMA_GRETYPE_END:
        raise RuntimeError(
            "malformed rule, does not end with LLAMA_GRETYPE_END: " + str(rule_id)
        )
...

        if case is llama_gretype.LLAMA_GRETYPE_END:
            raise RuntimeError("unexpected end of rule: " + str(rule_id) + "," + str(i))
...

def is_char_element(elem: LlamaGrammarElement) -> bool:
    return elem.type in (
        llama_gretype.LLAMA_GRETYPE_CHAR,
        llama_gretype.LLAMA_GRETYPE_CHAR_NOT,
        llama_gretype.LLAMA_GRETYPE_CHAR_ALT,
        llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER,
    )
...

            if rule[i + 1].type in (
                llama_gretype.LLAMA_GRETYPE_CHAR_ALT,
                llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER,
            ):

But I don't know why the previous one works. Haha 😅

talhalatifkhan · 2023-08-15T17:34:09Z

I am trying to make sure that my output follow a json format every time, i stumbled upon jsonformer and from there i stumbled upon grammar-based sampling, I used json-schema-to-grammar.py to convert json schema.

I want to know if grammar based sampling is used for this specific purpose and if so then how do i use it.

Json schema

json_schema = {
    "type": "object",
    "properties": {
        "Stage": {
            "type": "string",
            "enum": ["first", "second"]
        },
        "Task Finished": {"type": "boolean"},
        "Statement": {"type": "string"},
        "Assistant": {"type": "string"}
    }
}

Llama grammar

space ::= " "?
string ::=  "\"" (
        [^"\\] |
        "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* "\"" space 
Stage ::= "\"first\"" | "\"second\""
boolean ::= ("true" | "false") space
root ::= "{" space "\"Assistant\"" space ":" space string "," space "\"Stage\"" space ":" space Stage "," space "\"Statement\"" space ":" space string "," space "\"Task Finished\"" space ":" space boolean "}" space

Here is my code

from llama_cpp import Llama, LlamaGrammar

fs_template = """
You are a precise AI comparer. Your task is to match the user's intent to the statements in the context and confirm if the identified intent is correct.
Your responses should strictly follow the format below:
    Stage: [print 'first']
    User Intent: [insert user intent statement here]
    Task Finished: [insert boolean value based on whether user intent is confirmed]
    Assistant: [inser Assistant response here ]


Adhere to the following instructions to complete the task:
1. Start by trying to match the user's question to the statements in the context.
2. If you identify the matching statement to the user's question then confirm it from the user.
3. If the user's intent is unclear or doesn't match the context, ask follow-up questions by providing the options in the context.
4. Once you have confirmed the user intent, set "Task Finished: True" and proceed with your response.
5. You will fail your task if the output generated does not follow the format mentioned above.

Context: (only knowledge base you have)
------------
sample context
-----------
"""

schema = '''
space ::= " "?
string ::=  "\"" (
        [^"\\] |
        "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* "\"" space 
Stage ::= "\"first\"" | "\"second\""
boolean ::= ("true" | "false") space
root ::= "{" space "\"Assistant\"" space ":" space string "," space "\"Stage\"" space ":" space Stage "," space "\"Statement\"" space ":" space string "," space "\"Task Finished\"" space ":" space boolean "}" space
'''


def get_prompt(question: str, chat_history: list,
               system_prompt: str) -> str:
    texts = [f'[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n']
    for user_input, response in chat_history:
        texts.append(f'{user_input.strip()} [/INST] {response.strip()} </s><s> [INST] ')
    texts.append(f'{question.strip()} [/INST]')
    return ''.join(texts)


history = []
prompt = get_prompt("user query", history, fs_template)

grammar = LlamaGrammar.from_string(grammar=schema, verbose=True)
print(grammar)
client = Llama(
    model_path="model/llama-2-13b-chat.ggmlv3.q8_0.bin",
    n_ctx=4098,
    n_threads=16,
    last_n_tokens_size=70,
)

answer = client(
    prompt,
    grammar=grammar,
    stream=False,
    temperature=0.0,
    top_p=0.95,
    top_k=50,
    repeat_penalty=1.3,
    max_tokens=4000,
)
print(answer)

This is the error i am getting

parse: error parsing grammar: expecting newline or end at \] |
        "\" (["\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* """ space 
Stage ::= ""first"" | ""second""
boolean ::= ("true" | "false") space
root ::= "{" space ""Assistant"" space ":" space string "," space ""Stage"" space ":" space Stage "," space ""Statement"" space ":" space string "," space ""Task Finished"" space ":" space boolean "}" space

Traceback (most recent call last):
  File "/home/talha/CloudWhisper/jformer.py", line 49, in <module>
    grammar = LlamaGrammar.from_string(grammar=schema,verbose=True)
  File "/home/talha/.local/lib/python3.10/site-packages/llama_cpp/llama_grammar.py", line 66, in from_string
    raise ValueError(
ValueError: from_string: error parsing grammar file: parsed_grammar.rules is empty

c0sogi · 2023-08-16T02:09:50Z

@talhalatifkhan

I think your scheme is correct, but I think it's because of the typos in llama_grammar.py
Can you try out again, with the code I mentioned above?

talhalatifkhan · 2023-08-17T11:00:02Z

@c0sogi I tested it out and i am getting same result

parse: error parsing grammar: expecting newline or end at \] |
        "\" (["\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* """ space 
Stage ::= ""first"" | ""second""
boolean ::= ("true" | "false") space
root ::= "{" space ""Assistant"" space ":" space string "," space ""Stage"" space ":" space Stage "," space ""Statement"" space ":" space string "," space ""Task Finished"" space ":" space boolean "}" space

Traceback (most recent call last):
  File "/home/talha/CloudWhisper/jformer.py", line 52, in <module>
    grammar = LlamaGrammar.from_string(grammar=schema, verbose=True)
  File "/home/talha/.local/lib/python3.10/site-packages/llama_cpp/llama_grammar.py", line 66, in from_string
    raise ValueError(
ValueError: from_string: error parsing grammar file: parsed_grammar.rules is empty

c0sogi · 2023-08-17T11:54:28Z

@c0sogi I tested it out and i am getting same result

parse: error parsing grammar: expecting newline or end at \] |
        "\" (["\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* """ space 
Stage ::= ""first"" | ""second""
boolean ::= ("true" | "false") space
root ::= "{" space ""Assistant"" space ":" space string "," space ""Stage"" space ":" space Stage "," space ""Statement"" space ":" space string "," space ""Task Finished"" space ":" space boolean "}" space

Traceback (most recent call last):
  File "/home/talha/CloudWhisper/jformer.py", line 52, in <module>
    grammar = LlamaGrammar.from_string(grammar=schema, verbose=True)
  File "/home/talha/.local/lib/python3.10/site-packages/llama_cpp/llama_grammar.py", line 66, in from_string
    raise ValueError(
ValueError: from_string: error parsing grammar file: parsed_grammar.rules is empty

#615

Added low level grammar API

ac188a2

c0sogi changed the title ~~Added low-level grammar API~~ Add low-level grammar API Aug 5, 2023

Added grammar based sampling

418aa83

c0sogi changed the title ~~Add low-level grammar API~~ Add grammar-based sampling Aug 6, 2023

c0sogi mentioned this pull request Aug 6, 2023

Add support for OpenAI-style functions #494

Closed

2 tasks

reset grammar for every generation

b07713c

prevent memory access error by llama_grammar_free

0d7d203

c0sogi force-pushed the main branch from 6ec18f1 to 0d7d203 Compare August 7, 2023 08:03

abetlen merged commit 843b7cc into abetlen:main Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add grammar-based sampling #572

Add grammar-based sampling #572

c0sogi commented Aug 5, 2023 •

edited

Loading

c0sogi commented Aug 7, 2023

abetlen commented Aug 8, 2023

c0sogi commented Aug 9, 2023

talhalatifkhan commented Aug 15, 2023

c0sogi commented Aug 16, 2023

talhalatifkhan commented Aug 17, 2023

c0sogi commented Aug 17, 2023

Add grammar-based sampling #572

Add grammar-based sampling #572

Conversation

c0sogi commented Aug 5, 2023 • edited Loading

c0sogi commented Aug 7, 2023

abetlen commented Aug 8, 2023

c0sogi commented Aug 9, 2023

talhalatifkhan commented Aug 15, 2023

c0sogi commented Aug 16, 2023

talhalatifkhan commented Aug 17, 2023

c0sogi commented Aug 17, 2023

c0sogi commented Aug 5, 2023 •

edited

Loading