PaddlePaddle
diff --git a/‎docs/zh/llm/alignment/rl/README.md
+1 b/‎docs/zh/llm/alignment/rl/README.md
+1
diff --git a/‎llm/README.md
+3-3 b/‎llm/README.md
+3-3
diff --git a/‎llm/alignment/ppo/README.md renamed to ‎llm/alignment/rl/README.md
+36-10 b/‎llm/alignment/ppo/README.md renamed to ‎llm/alignment/rl/README.md
+36-10
diff --git a/‎llm/alignment/ppo/reward/client.py renamed to ‎llm/alignment/rl/reward/client.py b/‎llm/alignment/ppo/reward/client.py renamed to ‎llm/alignment/rl/reward/client.py
diff --git a/‎llm/alignment/ppo/reward/fake_reward_server.py renamed to ‎llm/alignment/rl/reward/fake_reward_server.py b/‎llm/alignment/ppo/reward/fake_reward_server.py renamed to ‎llm/alignment/rl/reward/fake_reward_server.py
diff --git a/‎llm/alignment/ppo/reward/reward_server.py renamed to ‎llm/alignment/rl/reward/reward_server.py b/‎llm/alignment/ppo/reward/reward_server.py renamed to ‎llm/alignment/rl/reward/reward_server.py
diff --git a/‎llm/alignment/ppo/run_ppo.py renamed to ‎llm/alignment/rl/run_ppo.py b/‎llm/alignment/ppo/run_ppo.py renamed to ‎llm/alignment/rl/run_ppo.py
diff --git a/‎llm/alignment/ppo/tests/run_model.py renamed to ‎llm/alignment/rl/tests/run_model.py b/‎llm/alignment/ppo/tests/run_model.py renamed to ‎llm/alignment/rl/tests/run_model.py
diff --git a/‎llm/alignment/ppo/tests/test_export.py renamed to ‎llm/alignment/rl/tests/test_export.py b/‎llm/alignment/ppo/tests/test_export.py renamed to ‎llm/alignment/rl/tests/test_export.py
diff --git a/‎llm/config/qwen/reinforce_plus_plus_argument.yaml
+119 b/‎llm/config/qwen/reinforce_plus_plus_argument.yaml
+119
@@ -0,0 +1 @@
+../../../../../llm/alignment/rl/README.md
@@ -186,7 +186,7 @@ python run_finetune.py ./config/qwen/pt_argument.json
 
 ### 3. 对齐
 
-我们支持 DPO、KTO、RLHF 等偏好对齐策略。DPO、KTO 策略采用 zero_padding 策略，结合 FlashMask 策略，有效提升模型训练效率。
+我们支持 DPO、KTO、RL 等偏好对齐策略。DPO、KTO 策略采用 zero_padding 策略，结合 FlashMask 策略，有效提升模型训练效率。
 
 #### 3.1 DPO
 
@@ -289,9 +289,9 @@ python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/
 python -u  -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" ./alignment/kto/run_kto.py ./config/llama/kto_lora_argument.json
 ```
 
-#### 3.3 RLHF
+#### 3.3 RL
 
-飞桨大模型套件提供了提供了基于强化学习 PPO 算法对 LLM 进行人类偏好对齐的代码及完整使用示例，支持**3D 分布式并行训练以及 rollout 阶段使用预测优化进行生成加速**。详细使用教程详见[RLHF 文档](./docs/rlhf.md)。
+飞桨大模型套件提供了提供了基于强化学习 GRPO、Reinforce++、PPO 等 算法对 LLM 进行人类偏好对齐的代码及完整使用示例，支持**3D 分布式并行训练以及 rollout 阶段使用预测优化进行生成加速**。详细使用教程详见[RL 文档](./alignment/rl/README.md)。
 
 ### 4. 模型融合
 PadlleNLP 支持多种模型融合方法，包括**Linear、Slerp、Ties、DARE、DELLA**，并支持模型参数稀疏化方法与模型融合算法的灵活组合使用。
 
@@ -8,7 +8,7 @@ REINFORCE++ 是经典 REINFORCE 算法的改进版本，通过融合 PPO 的关
 ## 环境依赖
 
 * 训练环境：
-1. 参考 Paddle 官网安装 PaddlePaddle-GPU
+1. 参考 [Paddle 官网](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html)安装 PaddlePaddle-GPU, 要求 PaddlePaddle>=3.0
 2. clone 并安装 PaddleNLP
 ```shell
 git clone https://github.com/PaddlePaddle/PaddleNLP.git
@@ -23,8 +23,6 @@ python setup_cuda.py install
 
 |   模型系列    | 模型名称                                                                                                                                                                                                                                                                      |
 |:-------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-|   Llama3.1    | meta-llama/Meta-Llama-3.1-8B, meta-llama/Meta-Llama-3.1-8B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3.1-70B-Instruct, meta-llama/Meta-Llama-3.1-405B, meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Llama-Guard-3-8B                              |
-|   Llama3.2    | meta-llama/Llama-3.2-1B, meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B, meta-llama/Llama-3.2-3B-Instruct                                                                                                                                                          |
 |    Qwen1.5    | Qwen/Qwen1.5-0.5B, Qwen/Qwen1.5-0.5B-Chat, Qwen/Qwen1.5-1.8B, Qwen/Qwen1.5-1.8B-Chat, Qwen/Qwen1.5-4B, Qwen/Qwen1.5-4B-Chat, Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, Qwen/Qwen1.5-14B, Qwen/Qwen1.5-14B-Chat, Qwen/Qwen1.5-32B, Qwen/Qwen1.5-32B-Chat                          |
 |     Qwen2     | Qwen/Qwen2-0.5B, Qwen/Qwen2-0.5B-Instruct, Qwen/Qwen2-1.5B, Qwen/Qwen2-1.5B-Instruct, Qwen/Qwen2-7B, Qwen/Qwen2-7B-Instruct, Qwen/Qwen2-72B, Qwen/Qwen2-72B-Instruct, Qwen/Qwen2-57B-A14B, Qwen/Qwen2-57B-A14B-Instruct                                                       |
 |  Qwen2-Math   | Qwen/Qwen2-Math-1.5B, Qwen/Qwen2-Math-1.5B-Instruct, Qwen/Qwen2-Math-7B, Qwen/Qwen2-Math-7B-Instruct                                                                                                                                                                          |
@@ -36,7 +34,7 @@ python setup_cuda.py install
 
 ### 字段说明
 
-- src (list(str)): 经过 chat_template 处理后的 prompt 输入；
+- src (list(str)): 经过 chat_template 处理后的 prompt 输入；或者根据需要自己拼接构造 prompt；
 - tgt (list(str)): 标签内容；
 
 ### 数据示例
@@ -59,7 +57,7 @@ wget https://paddlenlp.bj.bcebos.com/datasets/examples/ppo-kk.tgz && tar zxf ppo
 
 ### GRPO && REINFORCE++ 训练配置
 
-我们采用的配置文件放置在`llm/config/llama/grpo_argument.yaml`和`llm/config/qwen/grpo_argument.yaml`中，同时我们提供了详细参数释义如下：
+我们采用的配置文件放置在`llm/config/qwen/grpo_argument.yaml`中，同时我们提供了详细参数释义如下：
 - `rl_algorithm`: 使用的强化学习算法，支持`grpo`、`reinforce_plus_plus`
 - `actor_model_name_or_path`: actor-model 和 reference-model 模型本地的模型路径
 - `reward_model_name_or_path`: reward 模型的名称或本地路径
@@ -144,7 +142,7 @@ max_dec_len + max_prompt_len 应当小于 max_seq_len。
 
 ### GRPO 训练命令
 ```shell
-cd your_PaddleNLP_path/llm/alignment/ppo
+cd your_PaddleNLP_path/llm/alignment/rl
 ```
 
 ```shell
@@ -167,15 +165,43 @@ export FLAGS_mla_use_tensorcore=0
 export FLAGS_cascade_attention_max_partition_size=2048
 
 python -u -m paddle.distributed.launch --devices "0,1,2,3" run_ppo.py ../../config/qwen/grpo_argument.yaml
-# python -u -m paddle.distributed.launch --devices "0,1,2,3" run_ppo.py ../../config/llama/grpo_argument.yaml
 ```
+我们提供根据上述脚本可复现的[wandb 日志](https://api.wandb.ai/links/junyu/5jiulhem)。
 
-### REINFORCE++ 训练命令
-将配置文件`grpo_argument.yaml`中的`rl_algorithm`改为`reinforce_plus_plus`即可, 其它命令同 GRPO
 
+### Reinforce++ 训练命令
+```shell
+cd your_PaddleNLP_path/llm/alignment/rl
+```
+
+```shell
+# 启动 reward server
+python reward_server.py
+```
+
+```shell
+export PYTHONPATH=your_PaddleNLP_path/:$PYTHONPATH
+export PYTHONPATH=your_PaddleNLP_path/llm:$PYTHONPATH
+
+export FLAGS_set_to_1d=False
+export NVIDIA_TF32_OVERRIDE=0
+export FLAGS_dataloader_use_file_descriptor=False
+export HF_DATASETS_DOWNLOAD_TIMEOUT=1
+export FLAGS_gemm_use_half_precision_compute_type=False
+export FLAGS_force_cublaslt_no_reduced_precision_reduction=True
+
+export FLAGS_mla_use_tensorcore=0
+export FLAGS_cascade_attention_max_partition_size=2048
+
+python -u -m paddle.distributed.launch --devices "0,1,2,3" run_ppo.py ../../config/qwen/reinforce_plus_plus_argument.yaml
+```
+
+我们提供根据上述脚本可复现的[wandb 日志](https://api.wandb.ai/links/ainlp66-netflix/ps1dpaxm)。
 
 ### 在线监控
-在`grpo_argument.yaml`中设置的输出目录为`"logging_dir": "vdl_log"`, 可以通过以下命令查看训练过程
+在`grpo_argument.yaml`和`reinforce_plus_plus_argument.yaml`中设置的输出目录为`"logging_dir": "vdl_log"`, 可以通过以下命令查看训练过程
 ```shell
 visualdl --logdir vdl_log --host 0.0.0.0
 ```
+
+也支持 wandb 等多种监控，可设置`"logging_dir": "wandb"`，需要提前安装好 wandb 依赖并登录。
@@ -0,0 +1,119 @@
+# RL algorithms
+rl_algorithm: "reinforce_plus_plus" # The reinforcement learning algorithm used, supported: "ppo", "grpo", "reinforce_plus_plus"
+
+# models
+actor_model_name_or_path: "Qwen/Qwen2.5-7B-Instruct-1M" # The name or path of the actor model
+reward_model_name_or_path: "" # The name or path of the reward model
+use_rm_server: true # Whether to use the reward model server
+reward_server: "http://127.0.0.1:8731" # The address of the reward model server
+
+# logging
+logging_dir: grpo-logs # Directory for logging
+logging_steps: 1 # Number of steps between logging
+output_dir: "qwen2.5-7b-kk-dataset-grpo/checkpoints" # Directory for output ckpts
+report_to: "visualdl" # Supported reporting options: "all", "wandb", "tensorboard", "visualdl"(default), "none"
+wandb_http_proxy: "http://127.0.0.1:8962" # HTTP proxy for wandb
+run_name: "qwen2.5-7b-kk-dataset-grpo" # Name of the run
+
+# data
+train_datasets: "ppo-kk/34567ppl/train.jsonl" # Path to the training dataset
+eval_datasets: "ppo-kk/5ppl/test.jsonl" # Path to the evaluation dataset
+prompt_key: "src" # Key for the prompt in the dataset
+response_key: "tgt" # Key for the response in the dataset
+dataloader_drop_last: true # Whether to drop the last incomplete batch in the DataLoader
+dataloader_shuffle: false # Whether to shuffle the train dataset
+balance_batch: true # Whether to balance batch size across dataset_world_size
+use_remove_padding: true # Whether to remove padding tokens in the input
+
+# distributed training args
+tensor_parallel_degree: 2 # Degree of tensor parallelism
+sequence_parallel: true # Whether to enable sequence parallelism
+sharding_parallel_degree: -1 # Degree of sharding parallelism
+sharding: "stage1" # Sharding strategy, e.g., "stage1" or "stage2"
+sharding_parallel_config: "enable_release_grads" # Configuration for sharding parallelism
+pipeline_parallel_degree: 1 # Degree of pipeline parallelism
+virtual_pp_degree: 1 # Degree of virtual pipeline parallelism
+
+# rollout args
+max_prompt_len: 512 # Maximum length of the prompt, exceeding which will be automatically truncated
+max_dec_len: 4096 # Maximum length of the response
+min_dec_len: 32 # Minimum length of the response
+top_p: 1.0 # Top-p sampling parameter
+temperature: 0.7 # Temperature parameter for sampling
+repetition_penalty: 1.0 # Repetition penalty parameter
+rollout_max_num_seqs: 32 # The maximum number of sequences that can be processed in a single inference
+rollout_quant_type: "" # Quantization type, e.g., "weight_only_int8"
+
+# training args
+do_train: true # Whether to perform training
+seed: 42 # Random seed for reproducibility
+global_batch_size: 8 # Global batch size for training
+global_gen_batch_size: -1 # Global generation batch size for dynamic sampling
+global_mini_batch_size: -1 # Mini-batch size for training
+rollout_n: 8 # Number of rollouts
+update_iters: 1 # Number of training iterations for rollout samples
+per_device_logprob_batch_size: 8 # Log probability batch size per device
+per_device_reward_batch_size: 8 # Reward batch size per device
+per_device_value_batch_size: 8 # Value batch size per device
+per_device_train_batch_size: 8 # Training batch size per device
+# gradient_accumulation_steps: 1 # Gradient accumulation steps (auto-calculated)
+num_train_epochs: 6 # Number of training epochs
+max_length: 4608 # Maximum length for training, should be larger than max_prompt_len + max_dec_len
+learning_rate: 5e-7 # Learning rate for training
+lr_scheduler_type: "constant" # Learning rate scheduler type
+weight_decay: 1e-2 # Weight decay for the AdamW optimizer
+adam_beta1: 0.9 # AdamW optimizer beta1
+adam_beta2: 0.999 # AdamW optimizer beta2
+adam_epsilon: 1e-8 # AdamW optimizer epsilon
+max_grad_norm: 1.0 # Maximum gradient norm for clipping
+max_steps: 3600 # Maximum number of training steps
+save_steps: 300 # Number of steps between model saves
+save_strategy: "steps" # Strategy for saving models
+ignore_save_lr_and_optim: true # Whether to ignore saving learning rate and optimizer state (leave empty if not specified)
+disable_tqdm: true # Whether to disable tqdm progress bar
+
+# RL args
+kl_coeff: 0.0 # KL coefficient
+kl_loss_coeff: 0.000 # KL loss coefficient
+pg_loss_coeff: 1.0 # Policy gradient loss coefficient
+entropy_coeff: 0.0 # Entropy coefficient
+clip_range_ratio: 0.2 # The clipping range for ratio between the old and new policy. (PPO algorithm)
+clip_range_ratio_low: 0.2 # The clipping range for ratio between the old and new policy. (PPO algorithm)
+clip_range_ratio_high: 0.2 # The clipping range for ratio between the old and new policy. (PPO algorithm)
+clip_range_score: 10.0 # The clipping range for the output of the score model. The reward is clipped into [-clip_range_score, clip_range_score].
+enable_overlong_reward_buffer: false # Whether to enable overlong reward buffer
+overlong_reward_buffer: 256 # The length of the overlong reward buffer
+overlong_penalty_factor: 1.0 # The penalty factor for overlong reward buffer
+clip_range_value: 5.0 # The clipping range for the output of the value model. The value is clipped into [-clip_range_value, clip_range_value].
+normalize_reward: false # Whether to normalize reward
+normalize_advantage: false # Whether to normalize advantage
+dynamic_sampling: false # Whether to use dynamic sampling, which is introcuded in DAPO algorithm https://arxiv.org/abs/2503.14476
+max_gen_batches: 2 # Maximum number of generation batches for dynamic sampling
+use_fp32_compute: true # Whether to use fp32 to compute xx_log_prob,rewards, advantages and loss
+
+# eval args
+do_eval: true # Whether to perform evaluation
+per_device_eval_batch_size: 32 # Evaluation batch size per device
+evaluation_strategy: "steps" # Evaluation strategy, e.g., "steps"
+eval_steps: 20 # Number of steps between evaluations
+
+# device memory optimization args
+use_flash_attention: true # Whether to use fused attention operations
+use_fused_rms_norm: false # Whether to use fused RMS norm operations, which needs to install fused_ln in slm/model_zoo/gpt-3/external_ops
+use_fused_rope: false # Whether to use fused rope operations
+use_fused_head_and_loss_fn: true # Whether to use fused head and loss function
+use_fused_linear: true # Whether to use fused linear operations
+recompute: true # Whether to enable gradient checkpointing for memory optimization
+recompute_use_reentrant: true # Whether to use reentrant recompute
+recompute_granularity: "full" # Granularity of recompute
+bf16: true # Whether to use mixed precision with bfloat16
+fp16_opt_level: "O2" # Optimization level for fp16 and bf16 training
+amp_master_grad: false # Whether to use float32 weight gradients for master weights in amp opt level=’O2’
+amp_custom_black_list: "reduce_sum softmax_with_cross_entropy c_softmax_with_cross_entropy elementwise_div sin cos" # Custom black list for amp
+amp_custom_white_list: "lookup_table lookup_table_v2 flash_attn matmul matmul_v2 fused_gemm_epilogue" # Custom white list for amp
+offload_level: "freeze_model" # Level of model offloading to pinned memory, supported values: freeze_model, train_model, optimizer
+release_grads: true # Whether to release gradients
+offload_optim: false # Whether to offload optimizer to pinned memory
+
+# benchmark args
+skip_profile_timer: false # Whether to skip profiling timer
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+../../../../../llm/alignment/rl/README.md`