Preprocess with debug flag fails.

### Please check that this issue hasn't been reported before.

- [X] I searched previous [Bug Reports](https://github.com/OpenAccess-AI-Collective/axolotl/labels/bug) didn't find any similar reports.

### Expected Behavior

Preprocess with debug flag should work.
python -m axolotl.cli.preprocess  /content/test_axolotl.yaml --debug


### Current behaviour

Gives error. Have json file with each example in the json file is with {"text": <text_str>}.
I am doing Pretraining with Lora for Non-Eng lang.

[2024-04-19 09:05:02,918] [DEBUG] [axolotl.log:61] [PID:2346] [RANK:0] max_input_len: 600
Dropping Long Sequences (num_proc=2): 100% 17/17 [00:00<00:00, 99.19 examples/s]
Add position_id column (Sample Packing) (num_proc=2): 100% 17/17 [00:00<00:00, 70.88 examples/s]
[2024-04-19 09:05:03,502] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:2346] [RANK:0] Saving merged prepared dataset to disk... /content/d538aae6e42c7df428d20d3ff2685ad0
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 70, in <module>
    fire.Fire(do_cli)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/content/src/axolotl/src/axolotl/cli/preprocess.py", line 60, in do_cli
    load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
  File "/content/src/axolotl/src/axolotl/cli/__init__.py", line 397, in load_datasets
    train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
  File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset
    train_dataset, eval_dataset, prompters = load_prepare_datasets(
  File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets
    dataset, prompters = load_tokenized_prepared_datasets(
  File "/content/src/axolotl/src/axolotl/utils/data/sft.py", line 424, in load_tokenized_prepared_datasets
    dataset.save_to_disk(prepared_ds_path)
  File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 1515, in save_to_disk
    fs, _ = url_to_fs(dataset_path, **(storage_options or {}))
  File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 363, in url_to_fs
    chain = _un_chain(url, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fsspec/core.py", line 316, in _un_chain
    if "::" in path
TypeError: argument of type 'PosixPath' is not iterable


### Steps to reproduce

Use json with with each example in the json file is with {"text": <text_str>}.

Preprocess with debug flag.
python -m axolotl.cli.preprocess  /content/test_axolotl.yaml --debug
But i get the error.

### Config yaml

```yaml
base_model: google/gemma-7b
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: ./test_txt_data.json
    type: completion
    field: text
dataset_prepared_path: data/last_run_prepared
dataset_processes: 16
val_set_size: 0
output_dir: ./lora-out

adapter: lora
lora_model_dir:

gpu_memory_limit: 76

sequence_len: 1100
sample_packing: true
pad_to_sequence_len: true

lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
lora_target_modules: 
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head
lora_target_linear: true
lora_fan_in_fan_out:

save_safetensors: True

gradient_accumulation_steps: 2
micro_batch_size: 10
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002
warmup_steps: 20
save_steps: 5000

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 500
xformers_attention:
flash_attention: True

evals_per_epoch: 1
eval_table_size:
eval_max_new_tokens: 128
eval_sample_packing: False
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
```


### Possible solution

_No response_

### Which Operating Systems are you using?

- [X] Linux
- [ ] macOS
- [ ] Windows

### Python Version

3.10

### axolotl branch-commit

Latest

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Preprocess with debug flag fails. #1544

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Preprocess with debug flag fails. #1544

Description

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions