Description
Please check that this issue hasn't been reported before.
- I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Preprocess with debug flag should work.
python -m axolotl.cli.preprocess /content/test_axolotl.yaml --debug
Current behaviour
Gives error. Have json file with each example in the json file is with {"text": <text_str>}.
I am doing Pretraining with Lora for Non-Eng lang.
[2024-04-19 09:05:02,918] [DEBUG] [axolotl.log:61] [PID:2346] [RANK:0] max_input_len: 600
Dropping Long Sequences (num_proc=2): 100% 17/17 [00:00<00:00, 99.19 examples/s]
Add position_id column (Sample Packing) (num_proc=2): 100% 17/17 [00:00<00:00, 70.88 examples/s]
[2024-04-19 09:05:03,502] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:2346] [RANK:0] Saving merged prepared dataset to disk... /content/d538aae6e42c7df428d20d3ff2685ad0
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 70, in
fire.Fire(do_cli)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 60, in do_cli
load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
File "/content/src/axolotl/src/axolotl/cli/init.py", line 397, in load_datasets
train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset
train_dataset, eval_dataset, prompters = load_prepare_datasets(
File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets
dataset, prompters = load_tokenized_prepared_datasets(
File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 424, in load_tokenized_prepared_datasets
dataset.save_to_disk(prepared_ds_path)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 1515, in save_to_disk
fs, _ = url_to_fs(dataset_path, **(storage_options or {}))
File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 363, in url_to_fs
chain = _un_chain(url, kwargs)
File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 316, in _un_chain
if "::" in path
TypeError: argument of type 'PosixPath' is not iterable
Steps to reproduce
Use json with with each example in the json file is with {"text": <text_str>}.
Preprocess with debug flag.
python -m axolotl.cli.preprocess /content/test_axolotl.yaml --debug
But i get the error.
Config yaml
base_model: google/gemma-7b
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
- path: ./test_txt_data.json
type: completion
field: text
dataset_prepared_path: data/last_run_prepared
dataset_processes: 16
val_set_size: 0
output_dir: ./lora-out
adapter: lora
lora_model_dir:
gpu_memory_limit: 76
sequence_len: 1100
sample_packing: true
pad_to_sequence_len: true
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- down_proj
- up_proj
lora_modules_to_save:
- embed_tokens
- lm_head
lora_target_linear: true
lora_fan_in_fan_out:
save_safetensors: True
gradient_accumulation_steps: 2
micro_batch_size: 10
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002
warmup_steps: 20
save_steps: 5000
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 500
xformers_attention:
flash_attention: True
evals_per_epoch: 1
eval_table_size:
eval_max_new_tokens: 128
eval_sample_packing: False
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
Possible solution
No response
Which Operating Systems are you using?
- Linux
- macOS
- Windows
Python Version
3.10
axolotl branch-commit
Latest
Acknowledgements
- My issue title is concise, descriptive, and in title casing.
- I have searched the existing issues to make sure this bug has not been reported yet.
- I am using the latest version of axolotl.
- I have provided enough information for the maintainers to reproduce and diagnose the issue.