LLama3 streaming repeats the previous request's first token. #287

mikutsky · 2024-04-19T20:29:58Z

Hi! I'm running into a problem of repeating the first token in subsequent requests using a stream. The prompt structure follows the Meta LLama3 documentation. Could you explain why is this going on?

Simple chat example output looks in this way:

The model name is meta/meta-llama-3-70b-instruct

You: Hi!
Assistant: Hi! How can I help you today?

You: Recommend me a Hemingway novel, please.
Assistant: Hi
I'd recommend "The Old Man and the Sea". It's a classic, concise, and powerful novel that showcases Hemingway's unique writing style.

You: I read it, please recommend something else.
Assistant: Hi
I
How about "A Farewell to Arms"? It's a romantic and tragic novel set during WWI, and it's considered one of Hemingway's best works.

You: It's great! Thank you! Bye!
Assistant: Hi
I
How
You're welcome! I'm glad you enjoyed the recommendation. Have a great day and happy reading! Bye!

Example code:

import os
from replicate.client import Client

replicate_api_key = os.getenv("REPLICATE_API_TOKEN", 'EMPTY')
replicate_model = os.getenv('REPLICATE_MODEL', 'meta/meta-llama-3-70b-instruct')
replicate_client = Client(api_token=replicate_api_key)

SYSTEM_PROMPT = 'You are a helpful assistant. Answer briefly!'
MESSAGES = []


def gen_llama3_prompt(sys_prompt=None, messages=None):
    sys_prompt = '' if sys_prompt is None else sys_prompt
    messages = [] if messages is None else messages
    _result = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{sys_prompt}<|eot_id|>"
    for m in messages:
        if m['role'] == 'user':
            _result += f'<|start_header_id|>user<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
        elif m['role'] == 'assistant':
            _result += f'<|start_header_id|>assistant<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
    _result += '<|start_header_id|>assistant<|end_header_id|>\n\n'
    return _result


def print_answer(query=''):
    message = {'role': 'user', 'content': query}
    answer = ''
    MESSAGES.append(message)
    for event in replicate_client.stream(
            "meta/meta-llama-3-70b-instruct",
            input={
                "top_p": 1e-5,
                "prompt": gen_llama3_prompt(SYSTEM_PROMPT, MESSAGES),
                "max_tokens": 512,
                "min_tokens": 0,
                "temperature": 1e-6
            }):
        token = str(event)
        answer += token
        print(token, end='')
    message = {'role': 'assistant', 'content': answer}
    MESSAGES.append(message)


if __name__ == '__main__':
    print(f'Model name is {replicate_model}')
    while True:
        q = input('\nYou: ')
        print('Assistant: ', end='')
        print_answer(q)
        if 'bye' in q.lower():
            break

Thanks for your help!

The text was updated successfully, but these errors were encountered:

mattt · 2024-04-19T20:45:04Z

Hi @mikutsky. Thanks for reporting this. Can you share any predictions for these? (Go to your replicate.com Dashboard, look under Predictions). Seeing that would help us tell if the problem is in the model or the client library.

Gusakovskyi · 2024-04-19T20:53:11Z

Hi, have the same issue

mattt · 2024-04-19T21:06:58Z

@Gusakovskyi @mikutsky We've confirmed that there's an issue with stop sequences for meta/meta-llama-3-70b-instruct, and we're working on a fix.

mikutsky · 2024-04-19T21:11:07Z

Hi @mikutsky. Thanks for reporting this. Can you share any predictions for these? (Go to your replicate.com Dashboard, look under Predictions). Seeing that would help us tell if the problem is in the model or the client library.

It looks like the client library problem. I provide you second query info. Because the next queries collect mistakes in the prompt.

Everything looks correct on the dashboard:

Here is the prompt for the second query, and the prompt is still correct:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant. Answer briefly!<|eot_id|><|start_header_id|>user<|end_header_id|>

Hi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi! How can I help you today?<|eot_id|><|start_header_id|>user<|end_header_id|>

I read it, please recommend something else.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

However console output contains the extra tag 'Hi':

You: >? Hi!
Assistant: Hi! How can I help you today?

You: >? I read it, please recommend something else.
Assistant: Hi
I'd be happy to! However, I need a bit more information. What type of content are you in the mood for? A book, article, podcast, or something else?

mattt · 2024-04-19T21:16:38Z

@mikutsky We just pushed a new build of the model, which should address the stop sequence problem. Please give your client code another try and let me know if that's working for you now.

If not, could you please try calling replicate.stream in isolation? I'd like to rule out the use of input and accessing mutable state in a loop, even though that should be running synchronously and not be a problem.

mattt · 2024-04-19T21:23:06Z

Actually, I'm able to reproduce this in isolation, so it does appear to be an issue with the client. Working on a fix now.

Fixes #287 Given the following code: ```python import replicate def go(): for event in replicate.stream( "meta/meta-llama-3-8b-instruct", input={ "top_p": 0.9, "prompt": "Hi! Help me please:)", "max_tokens": 512, "min_tokens": 0, "temperature": 0.01, "stop_sequences": "<|end_of_text|>", "prompt_template": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", "presence_penalty": 0, "frequency_penalty": 1, }, ): print(str(event), end="") print() go() print("---------------------------------") go() print("---------------------------------") go() print("---------------------------------") ``` The latest release repeats the same token once for each previous invocation. <details> ``` Hi there! I'd be happy to help you with whatever you need. What's on your mind?assistant I'm glad you asked! I'm here to help you with any questions or problems you might have. Whether it's something big or something small, I'm here to help. So, what's on your mind? Do you have a specific question or problem you'd like to talk about?assistant I'm all ears! Take your time, and feel free to share as much or as little as you'd like. Remember, everything we discuss is confidential and just between us. So, --------------------------------- Hi Hi there! I'd be happy to help you with whatever you need. What's on your mind?assistant I'm glad you asked! I'm here to help you with any questions or problems you might have. Whether it's something big or something small, I'm here to help. So, what's on your mind? Do you have a specific question or problem you'd like to talk about?assistant I'm all ears! Take your time, and feel free to share as much or as little as you'd like. Remember, everything we discuss is confidential and just between us. So, --------------------------------- Hi Hi Hi there! I'd be happy to help you with whatever you need. What's on your mind?assistant I'm glad you asked! I'm here to help you with any questions or problems you might have. Whether it's something big or something small, I'm here to help. So, what's on your mind? Do you have a specific question or problem you'd like to talk about?assistant I'm all ears! Take your time, and feel free to share as much or as little as you'd like. Remember, everything we discuss is confidential and just between us. So, --------------------------------- ``` </details> This is caused by incorrect initialization of the stream `Decoder` object. After applying this change, the code produces the correct behavior: <details> ``` Hi there! I'd be happy to help you with whatever you need. What's on your mind?assistant I'm glad you asked! I'm here to help you with any questions or problems you might have. Whether it's something big or something small, I'm here to help. So, what's on your mind? Do you have a specific question or problem you'd like to talk about?assistant I'm all ears! Take your time, and feel free to share as much or as little as you'd like. Remember, everything we discuss is confidential and just between us. So, --------------------------------- Hi there! I'd be happy to help you with whatever you need. What's on your mind?assistant I'm glad you asked! I'm here to help you with any questions or problems you might have. Whether it's something big or something small, I'm here to help. So, what's on your mind? Do you have a specific question or problem you'd like to talk about?assistant I'm all ears! Take your time, and feel free to share as much or as little as you'd like. Remember, everything we discuss is confidential and just between us. So, --------------------------------- Hi there! I'd be happy to help you with whatever you need. What's on your mind?assistant I'm glad you asked! I'm here to help you with any questions or problems you might have. Whether it's something big or something small, I'm here to help. So, what's on your mind? Do you have a specific question or problem you'd like to talk about?assistant I'm all ears! Take your time, and feel free to share as much or as little as you'd like. Remember, everything we discuss is confidential and just between us. So, --------------------------------- ``` Signed-off-by: Mattt Zmuda <[email protected]>

mattt · 2024-04-19T21:50:04Z

@mikutsky @Gusakovskyi Thanks again for reporting. This should be fixed by 0.25.2.

Please let me know if you continue to see this behavior.

mikutsky · 2024-04-19T22:11:22Z

@mikutsky @Gusakovskyi Thanks again for reporting. This should be fixed by 0.25.2.

Please let me know if you continue to see this behavior.

Thanks a lot! It works!

mattt mentioned this issue Apr 19, 2024

Fix initialization of stream decoder #288

Merged

mattt closed this as completed in #288 Apr 19, 2024

mattt mentioned this issue Apr 22, 2024

replicate.stream() gives blank spaces at the beginning of output stream #277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLama3 streaming repeats the previous request's first token. #287

LLama3 streaming repeats the previous request's first token. #287

mikutsky commented Apr 19, 2024 •

edited

Loading

mattt commented Apr 19, 2024 •

edited

Loading

Gusakovskyi commented Apr 19, 2024 •

edited

Loading

mattt commented Apr 19, 2024

mikutsky commented Apr 19, 2024

mattt commented Apr 19, 2024 •

edited

Loading

mattt commented Apr 19, 2024

mattt commented Apr 19, 2024

mikutsky commented Apr 19, 2024

LLama3 streaming repeats the previous request's first token. #287

LLama3 streaming repeats the previous request's first token. #287

Comments

mikutsky commented Apr 19, 2024 • edited Loading

mattt commented Apr 19, 2024 • edited Loading

Gusakovskyi commented Apr 19, 2024 • edited Loading

mattt commented Apr 19, 2024

mikutsky commented Apr 19, 2024

mattt commented Apr 19, 2024 • edited Loading

mattt commented Apr 19, 2024

mattt commented Apr 19, 2024

mikutsky commented Apr 19, 2024

mikutsky commented Apr 19, 2024 •

edited

Loading

mattt commented Apr 19, 2024 •

edited

Loading

Gusakovskyi commented Apr 19, 2024 •

edited

Loading

mattt commented Apr 19, 2024 •

edited

Loading