Bug: "main : failed to eval" with Self-extend and small context

### What happened?

I have been playing with the context window and I have been encountering issues running the "Llama-3-Smaug-q2_k.gguf" model. When I run `llama-cli` with that model using the default execution with the command below, the program behaves as expected

```
out/bin/llama-cli -m $MODEL -ngl 99 -c 1024 -b 256 --repeat_penalty 1.1 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --override-kv tokenizer.ggml.pre=str:llama3
```

However, when the "Self-Extend" is enabled (gan/gaw) in interactive mode, after a while (> than context) it crashes with `main : failed to eval`. Here is the command: 
```
out/bin/llama-cli -m $MODEL -ngl 99 -c 1024 -b 256 --repeat_penalty 1.1 --color -i -r "User:" -f ./prompts/chat-with-bob.txt --override-kv tokenizer.ggml.pre=str:llama3 -gan 2 -gaw 256
```

Below is a relevant log output asking these questions:
```
explain how to create pancakes step by step
what about cakes?
explain how to create a video in blender3D
what else can I do in that software?
are there alternatives to it?
explain why the middle east is a really conflicting place
which are the most conflicting countries?
what are the requirements to become president of Bulgaria
continue
```

Also, I noticed that the "examples/passkey" has a different implementation for the "Self-extend" code as it does "examples/main". Which one is the correct one?

Thanks for your help.

### Name and Version

llama-cli -v
version: 3392 (bda62d79)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0

### What operating system are you seeing the problem on?

Mac

### Relevant log output

```shell
main: build = 3392 (bda62d79)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
main: seed  = 1721308668
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ../models/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 10
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q2_K:  129 tensors
llama_model_loader: - type q3_K:   64 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q6_K:    1 tensors
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = llama3
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = llama3
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 2.95 GiB (3.16 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  2860.02 MiB, ( 2860.08 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   164.39 MiB
llm_load_tensors:      Metal buffer size =  2860.00 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
llama_kv_cache_init:      Metal KV buffer size =   128.00 MiB
llama_new_context_with_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      Metal compute buffer size =   129.25 MiB
llama_new_context_with_model:        CPU compute buffer size =     5.00 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 10 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
main: interactive mode on.
Reverse prompt: 'User:'
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 1024, n_batch = 256, n_predict = -1, n_keep = 1
self-extend: n_ctx_train = 8192, grp_attn_n = 2, grp_attn_w = 256


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:explain how to create pancakes step by step
Bob: Ahaha! I'd be happy to help you with that. Here's a simple recipe to make delicious pancakes:

Step 1: In a bowl, whisk together 1 cup of milk, 2 eggs, and 1 tablespoon of oil.
Step 2: Add 1/2 teaspoon baking powder, salt, and vanilla extract. Mix well.
Step 3: Heat a non-stick pan over medium heat with butter or cooking spray.
Step 4: When the pan is hot, scoop about 1/4 cup of batter onto the pan.
Step 5: Cook for about 2 minutes and flip the pancake.
Step 6: Cook for another minute and serve warm.

Would you like any variations on this recipe? Bob
User:what about cakes?
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
Bob: Ahaha! You must be referring to birthday cakes. I'd be happy to help you with that too!
Here's a simple recipe for a basic cake:

Step 1: Preheat oven to 350°F (175°C).
Step 2: Mix together 1 cup of sugar, 1/2 cup brown sugar, and 1 teaspoon baking powder in a bowl.
Step 3: Add eggs one by one while mixing well after each addition.
Step 4: In another bowl, mix together flour, butter, and buttermilk.
Step 5: Combine the wet and dry ingredients together. Mix until smooth.
Step 6: Pour batter into greased pans and bake for 25-35 minutes.

Do you have any questions about this recipe? Bob
User:explain how to create a video in blender3D
Bob: Ahaha! I'd be happy to help you with that. Here's a simple guide on creating a video using Blender:

Step 1: Download and install Blender from their official website.
Step 2: Launch Blender and create a new project by clicking the "New" button.
Step 3: In the "Project Settings," choose the resolution, frame rate, and format you wantggml_backend_sched_alloc_splits: failed to allocate graph, reserving
 for your video.
Step 4: Choose a camera angle or object to render.
Step 5: Use keyframes to control animations.

Would you like any tips on rendering? Bob
User:what else can I do in that software?
Bob: Ahaha! You're just getting started with Blender. Here are some things you can do:

1. Create 3D models and animations using Blender's built-in tools.
2. Render images, videos, or even animated GIFs.
3. Use the software for 2D and 3D rendering tasks.

You have so many creative possibilities with Blender! Bob
User:are there alternatives to it?
Bob: Ahaha! Yes, there are alternative software options available. Here are a few:

1. LightWave 3D
2. Maya
3. Cinema4D

These are popular alternatives you can consider.

Would you like any more information on those? Bob
User:explain why the middle east is a really conflicting place
Bob: Ahaha! Let me help you understand that. The Middle East has been a complex and volatile region with many historical conflicts.

Step 1: Identify the factors contributing to tensions, such as religion, politics, resources, and territorial disputes.
Step 2ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
: Recognize the complexity of these issues, including the roles of various actors like governments, terrorist groups, and foreign powers.
Step 3: Understand that each side has a different perspective on events.

Would you like any more information on this topic? Bob
User:which are the most conflicting countries?
Bob: Ahaha! According to various sources, some of the most conflicting countries in the Middle East include:

1. Israel and Palestine
2. Iran and Saudi Arabia
3. Iraq and Turkey

These have been hotspots for years, with ongoing disputes.

Would you like any more information on these? Bob
User:what are the requirements to become president of Bulgaria
Bob: Ahaha! To become President of Bulgaria, you'll need:

1. Bulgarian citizenship
2. Age at least 35 and not exceeding 65 years old
3. Residency in Bulgaria for at least three years prior to election

You can find more information on these requirements here.

Would you like any more details? Bob
User:continue
Bob: Ahaha! You're welcome to continue exploring topics, Bob. Just type your next question or topic you'd like assistance with, and I'll do my best to help. Have any more questions? Bob
User:
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
main : failed to eval
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: "main : failed to eval" with Self-extend and small context #8570

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: "main : failed to eval" with Self-extend and small context #8570

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions