Skip to content

Add grammar-based sampling #572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 8, 2023
Merged

Add grammar-based sampling #572

merged 4 commits into from
Aug 8, 2023

Conversation

c0sogi
Copy link
Contributor

@c0sogi c0sogi commented Aug 5, 2023

Recently, grammar based sampling was merged into llama.cpp.
However, there's no explicit parser API we can currently use in Python. Therefore, I translated the grammar-parser.cpp into llama_grammar.py.

I've tested it using vendor/llama.cpp/grammars/json.gbnf, and the output of parsed grammar was perfectly the same as compiled version. See the example below. I hope this will help implementing function call someday!

Test code:

from llama_cpp.llama import Llama


llm = Llama(
    model_path="./models/llama-7B/ggml-model.bin",
    grammar="./vendor/llama.cpp/grammars/json.gbnf",
)
print(llm(prompt="")["choices"][0]["text"])

Output:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9
llama.cpp: loading model from ./models/llama-7B/ggml-model.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 240
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2302.96 MB (+  162.50 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/29 layers to GPU
llama_model_load_internal: total VRAM used: 288 MB
llama_new_context_with_model: kv self size  =  162.50 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
from_string grammar:
root ::= object
object ::= [{] ws object_11 [}]
value ::= object | array | string | number | boolean | [n] [u] [l] [l]
array ::= [[] ws array_15 []]
string ::= ["] string_18 ["] ws
number ::= number_19 number_20 ws
boolean ::= boolean_21 ws
ws ::= ws_23
object_8 ::= string [:] ws value object_10
object_9 ::= [,] ws string [:] ws value
object_10 ::= object_9 object_10 |
object_11 ::= object_8 |
array_12 ::= value array_14
array_13 ::= [,] ws value
array_14 ::= array_13 array_14 |
array_15 ::= array_12 |
string_16 ::= [^"\] | [\] string_17
string_17 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]
string_18 ::= string_16 string_18 |
number_19 ::= [-] |
number_20 ::= [0-9] number_20 | [0-9]
boolean_21 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e]
ws_22 ::= [ <U+0009><U+000A>] ws
ws_23 ::= ws_22 |


llama_print_timings:        load time =   213.76 ms
llama_print_timings:      sample time =   192.72 ms /    28 runs   (    6.88 ms per token,   145.29 tokens per second)
llama_print_timings: prompt eval time =   213.74 ms /     2 tokens (  106.87 ms per token,     9.36 tokens per second)
llama_print_timings:        eval time =  1346.34 ms /    27 runs   (   49.86 ms per token,    20.05 tokens per second)
llama_print_timings:       total time =  1797.62 ms
{
 "name": "Ralph",
 "age": 50,
 "gender": "male"
}

Once this merged, I will continue PR function call feature example.
This will parse real python function into grammar string. See the test result below

if __name__ == "__main__":
    from llama_cpp import LlamaGrammar, Llama
    
    
    # Define a python function and parse it into a grammar
    def get_current_weather(
        location: Annotated[
            str,
            "The location to get the current weather for",
        ],
        unit: Annotated[
            str,
            "The unit of temperature to return",
            ["fahrenheit", "celsius"],
        ],
        source: Annotated[
            str,
            "The source of the weather information",
            ["openweathermap", "weatherapi"],
        ] = "openweathermap",
    ):
        """Get the current weather in a given location"""
    
    model_path = "C:/Users/sdml/Desktop/orca-mini-3b.ggmlv3.q4_0.bin"
    grammar: str = SchemaConverter.from_function(get_current_weather)
    llama_grammar = LlamaGrammar.from_string(grammar)
    llm = Llama(model_path)
    llm.grammar = llama_grammar
    print(llm(prompt="### User: What is the weather in London today? ### Assistant:")["choices"][0]["text"])
    
    # Output:
    # { "location": "London", "source": "openweathermap","unit" : "celsius"}

@c0sogi c0sogi changed the title Added low-level grammar API Add low-level grammar API Aug 5, 2023
@c0sogi c0sogi changed the title Add low-level grammar API Add grammar-based sampling Aug 6, 2023
@c0sogi c0sogi mentioned this pull request Aug 6, 2023
2 tasks
@c0sogi
Copy link
Contributor Author

c0sogi commented Aug 7, 2023

After a further review, I noticed that the grammar sampling doesn't work after one completion.
This seems like intended behavior for non-interactive mode. Therefore, I added few lines of code that reset grammar for every generation, so grammar sampling can be applied for every completion.

            if (n_past > 0) {
                if (is_interacting) {
                    // reset grammar state if we're restarting generation
                    if (grammar != NULL) {
                        llama_grammar_free(grammar);

                        std::vector<const llama_grammar_element *> grammar_rules(
                            parsed_grammar.c_rules());
                        grammar = llama_grammar_init(
                            grammar_rules.data(), grammar_rules.size(),
                            parsed_grammar.symbol_ids.at("root"));
                    }
                }
                is_interacting = false;
            }
        }
class LlamaGrammar:
    ...
    def reset(self) -> None:
        llama_cpp.llama_grammar_free(self.grammar)
        self.grammar = llama_cpp.llama_grammar_init(
            self.rules, self.n_rules, self.start_rule_index
        )

class Llama:
    def generate(...):
         ...
        if reset:
            self.reset()

        if self.grammar is not None:
            self.grammar.reset()

        while True:
            self.eval(tokens)
        ...

@abetlen
Copy link
Owner

abetlen commented Aug 8, 2023

@c0sogi amazing work, I was planning on pulling in the grammar based sampling but ran into an issue with the parser. I'll merge this in but I'll likely move the grammar directly to the generate and completion calls, this way a different grammar can be used for the same Llama model.

@abetlen abetlen merged commit 843b7cc into abetlen:main Aug 8, 2023
@c0sogi
Copy link
Contributor Author

c0sogi commented Aug 9, 2023

@abetlen My mistake. These four enum-related parts

    if rule.empty() or rule.back().type != llama_gretype.LLAMA_GRETYPE_END.value:
        raise RuntimeError(
            "malformed rule, does not end with LLAMA_GRETYPE_END: " + str(rule_id)
        )
...

        if case is llama_gretype.LLAMA_GRETYPE_END.value:
            raise RuntimeError("unexpected end of rule: " + str(rule_id) + "," + str(i))

...

def is_char_element(elem: LlamaGrammarElement) -> bool:
    return elem.type in (
        llama_gretype.LLAMA_GRETYPE_CHAR.value,
        llama_gretype.LLAMA_GRETYPE_CHAR_NOT.value,
        llama_gretype.LLAMA_GRETYPE_CHAR_ALT.value,
        llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER.value,
    )
...

            if rule[i + 1].type in (
                llama_gretype.LLAMA_GRETYPE_CHAR_ALT.value,
                llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER.value,
            ):

should be these:

    if rule.empty() or rule.back().type is not llama_gretype.LLAMA_GRETYPE_END:
        raise RuntimeError(
            "malformed rule, does not end with LLAMA_GRETYPE_END: " + str(rule_id)
        )
...

        if case is llama_gretype.LLAMA_GRETYPE_END:
            raise RuntimeError("unexpected end of rule: " + str(rule_id) + "," + str(i))
...

def is_char_element(elem: LlamaGrammarElement) -> bool:
    return elem.type in (
        llama_gretype.LLAMA_GRETYPE_CHAR,
        llama_gretype.LLAMA_GRETYPE_CHAR_NOT,
        llama_gretype.LLAMA_GRETYPE_CHAR_ALT,
        llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER,
    )
...

            if rule[i + 1].type in (
                llama_gretype.LLAMA_GRETYPE_CHAR_ALT,
                llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER,
            ):

But I don't know why the previous one works. Haha 😅

@talhalatifkhan
Copy link

I am trying to make sure that my output follow a json format every time, i stumbled upon jsonformer and from there i stumbled upon grammar-based sampling, I used json-schema-to-grammar.py to convert json schema.

I want to know if grammar based sampling is used for this specific purpose and if so then how do i use it.

Json schema

json_schema = {
    "type": "object",
    "properties": {
        "Stage": {
            "type": "string",
            "enum": ["first", "second"]
        },
        "Task Finished": {"type": "boolean"},
        "Statement": {"type": "string"},
        "Assistant": {"type": "string"}
    }
}

Llama grammar

space ::= " "?
string ::=  "\"" (
        [^"\\] |
        "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* "\"" space 
Stage ::= "\"first\"" | "\"second\""
boolean ::= ("true" | "false") space
root ::= "{" space "\"Assistant\"" space ":" space string "," space "\"Stage\"" space ":" space Stage "," space "\"Statement\"" space ":" space string "," space "\"Task Finished\"" space ":" space boolean "}" space

Here is my code

from llama_cpp import Llama, LlamaGrammar

fs_template = """
You are a precise AI comparer. Your task is to match the user's intent to the statements in the context and confirm if the identified intent is correct.
Your responses should strictly follow the format below:
    Stage: [print 'first']
    User Intent: [insert user intent statement here]
    Task Finished: [insert boolean value based on whether user intent is confirmed]
    Assistant: [inser Assistant response here ]


Adhere to the following instructions to complete the task:
1. Start by trying to match the user's question to the statements in the context.
2. If you identify the matching statement to the user's question then confirm it from the user.
3. If the user's intent is unclear or doesn't match the context, ask follow-up questions by providing the options in the context.
4. Once you have confirmed the user intent, set "Task Finished: True" and proceed with your response.
5. You will fail your task if the output generated does not follow the format mentioned above.

Context: (only knowledge base you have)
------------
sample context
-----------
"""

schema = '''
space ::= " "?
string ::=  "\"" (
        [^"\\] |
        "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* "\"" space 
Stage ::= "\"first\"" | "\"second\""
boolean ::= ("true" | "false") space
root ::= "{" space "\"Assistant\"" space ":" space string "," space "\"Stage\"" space ":" space Stage "," space "\"Statement\"" space ":" space string "," space "\"Task Finished\"" space ":" space boolean "}" space
'''


def get_prompt(question: str, chat_history: list,
               system_prompt: str) -> str:
    texts = [f'[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n']
    for user_input, response in chat_history:
        texts.append(f'{user_input.strip()} [/INST] {response.strip()} </s><s> [INST] ')
    texts.append(f'{question.strip()} [/INST]')
    return ''.join(texts)


history = []
prompt = get_prompt("user query", history, fs_template)

grammar = LlamaGrammar.from_string(grammar=schema, verbose=True)
print(grammar)
client = Llama(
    model_path="model/llama-2-13b-chat.ggmlv3.q8_0.bin",
    n_ctx=4098,
    n_threads=16,
    last_n_tokens_size=70,
)

answer = client(
    prompt,
    grammar=grammar,
    stream=False,
    temperature=0.0,
    top_p=0.95,
    top_k=50,
    repeat_penalty=1.3,
    max_tokens=4000,
)
print(answer)

This is the error i am getting

parse: error parsing grammar: expecting newline or end at \] |
        "\" (["\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* """ space 
Stage ::= ""first"" | ""second""
boolean ::= ("true" | "false") space
root ::= "{" space ""Assistant"" space ":" space string "," space ""Stage"" space ":" space Stage "," space ""Statement"" space ":" space string "," space ""Task Finished"" space ":" space boolean "}" space

Traceback (most recent call last):
  File "/home/talha/CloudWhisper/jformer.py", line 49, in <module>
    grammar = LlamaGrammar.from_string(grammar=schema,verbose=True)
  File "/home/talha/.local/lib/python3.10/site-packages/llama_cpp/llama_grammar.py", line 66, in from_string
    raise ValueError(
ValueError: from_string: error parsing grammar file: parsed_grammar.rules is empty

@c0sogi
Copy link
Contributor Author

c0sogi commented Aug 16, 2023

@talhalatifkhan

I think your scheme is correct, but I think it's because of the typos in llama_grammar.py
Can you try out again, with the code I mentioned above?

@talhalatifkhan
Copy link

@c0sogi I tested it out and i am getting same result

parse: error parsing grammar: expecting newline or end at \] |
        "\" (["\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* """ space 
Stage ::= ""first"" | ""second""
boolean ::= ("true" | "false") space
root ::= "{" space ""Assistant"" space ":" space string "," space ""Stage"" space ":" space Stage "," space ""Statement"" space ":" space string "," space ""Task Finished"" space ":" space boolean "}" space

Traceback (most recent call last):
  File "/home/talha/CloudWhisper/jformer.py", line 52, in <module>
    grammar = LlamaGrammar.from_string(grammar=schema, verbose=True)
  File "/home/talha/.local/lib/python3.10/site-packages/llama_cpp/llama_grammar.py", line 66, in from_string
    raise ValueError(
ValueError: from_string: error parsing grammar file: parsed_grammar.rules is empty

@c0sogi
Copy link
Contributor Author

c0sogi commented Aug 17, 2023

@c0sogi I tested it out and i am getting same result

parse: error parsing grammar: expecting newline or end at \] |
        "\" (["\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
      )* """ space 
Stage ::= ""first"" | ""second""
boolean ::= ("true" | "false") space
root ::= "{" space ""Assistant"" space ":" space string "," space ""Stage"" space ":" space Stage "," space ""Statement"" space ":" space string "," space ""Task Finished"" space ":" space boolean "}" space

Traceback (most recent call last):
  File "/home/talha/CloudWhisper/jformer.py", line 52, in <module>
    grammar = LlamaGrammar.from_string(grammar=schema, verbose=True)
  File "/home/talha/.local/lib/python3.10/site-packages/llama_cpp/llama_grammar.py", line 66, in from_string
    raise ValueError(
ValueError: from_string: error parsing grammar file: parsed_grammar.rules is empty

#615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants