Configurations

Note

This page serves as a reference manual for the configuration objects, i.e., you can check which attributes can be modified and their default values. You don’t need to read through this page before running experiments!

Please check the Quickstart and Customization sections for concrete examples of running experiments.

We illustrate configurations for quickstart experiments in this page. Each type of experiment (e.g., SFT, PPO) corresponds to a specific configuration object (e.g., realhf.SFTConfig for SFT).

Since ReaL uses Hydra for configuration management, users can override these options provided by the class recursively with command line arguments.

Experiment Configurations

class realhf.ExperimentSaveEvalControl(total_train_epochs: int = 1, save_freq_epochs: int | None = None, save_freq_steps: int | None = None, save_freq_secs: int | None = None, eval_freq_epochs: int | None = None, eval_freq_steps: int | None = None, eval_freq_secs: int | None = None, benchmark_steps: int | None = None)

Utility object for controlling the frequency of saving and evaluation during training.

Epoch refers to the number of times the training loop iterates over the entire dataset. Step refers to the number of iterations running the algorithm dataflow.

This object manages independent counters for epochs, steps, and seconds. The model will be saved or evaluated when any of the following conditions are met.

Parameters:
  • total_train_epochs (int) – The total number of epochs to train the model.

  • save_freq_epochs (Optional[int]) – Frequency in epochs at which to save the model. If None, the model will not be saved based on epoch changes during training.

  • save_freq_steps (Optional[int]) – Frequency in steps at which to save the model. If None, the model will not be saved based on step changes during training.

  • save_freq_secs (Optional[int]) – Frequency in seconds at which to save the model. If None, the model will not be saved based on time changes during training.

  • eval_freq_epochs (Optional[int]) – Frequency in epochs at which to evaluate the model. If None, the model will not be evaluated based on epoch changes during training.

  • eval_freq_steps (Optional[int]) – Frequency in steps at which to evaluate the model. If None, the model will not be evaluated based on step changes during training.

  • eval_freq_secs (Optional[int]) – Frequency in seconds at which to evaluate the model. If None, the model will not be evaluated based on time changes during training.

  • benchmark_steps (Optional[int]) – Terminate training after this number of steps. Used for system benchmarking only. Set to None for normal training.

class realhf.CommonExperimentConfig(experiment_name: str = '???', trial_name: str = '???', mode: str = 'slurm', debug: bool = True, partition: str = 'dev', wandb_mode: str = 'disabled', image_name: str | None = None, recover_mode: str = 'disabled', recover_retries: int = 1, ignore_worker_error: bool = False, allocation_mode: str = 'pipe_model', allocation_use_cache: bool = False, n_nodes: int = 1, n_gpus_per_node: int = 8, nodelist: str | None = None, seed: int = 1, cache_clear_freq: int | None = 10, exp_ctrl: ~realhf.api.core.system_api.ExperimentSaveEvalControl = <factory>)

Configuration for quickstart experiments.

All members can be modified via the command line. For example,

$ python3 -m realhf.apps.quickstart sft trial_name=my_trial seed=42 exp_ctrl.save_freq_steps=10 ...

This command changes the trial_name, seed, and the save_freq_steps attribute of the exp_ctrl attribute in this class.

recover_mode can be one of the following:

  • auto: Automatically recover the last failed run.

  • save: Save recovery states if an error occurs.

  • resume: Resume from saved recovery states and save states if a failure occurs again.

  • disabled: Do nothing but raise an error if one occurs.

If you are not familiar with ReaL’s recovery mechanism, set this to disabled. Normal checkpointing is usually sufficient in most cases.

allocation_mode can be one of the following:

  • manual: Manually allocate resources using the specified command-line options.

  • search: Allocate resources and configure parallel strategies using the search engine.

  • heuristic: Allocate resources and configure parallel strategies using heuristic strategies obtained from a search.

  • pipe_data: Identical parallelization (like DSChat) with pipe+data parallelism. For a world size under 8, only data parallelism will be used.

  • pipe_model: Identical parallelization (like DSChat) with pipe+model parallelism. For a world size under 8, only tensor-model parallelism will be used.

  • A regex pattern like d${DP}p${PP}m${TP}: Identical parallelization for all MFCs with ${DP}-way data parallelism, ${PP}-way pipeline parallelism, and ${TP}-way model parallelism.

Parameters:
  • experiment_name (str) – The name of the experiment. An arbitrary string without “_” and “/”, e.g., ultra-chat-llama. This parameter is required.

  • trial_name (str) – The name of the trial. An arbitrary string without “-” and “/”, e.g., lr1e-3wd0.05. This parameter is required.

  • mode (str) – The experiment launching mode. Supported values are “local”, “ray”, or “slurm”. “ray” mode requires launching the Ray cluster via CLI. “slurm” mode requires the Pyxis plugin with the Enroot container enabled. “local” mode implies n_nodes=1.

  • debug (bool) – Whether to run in debug mode. Setting this to False will disable all assertions, which will be faster but less safe.

  • partition (str) – The SLURM partition for running the experiment.

  • wandb_mode (str) – The mode for WandB. Currently, WandB logging is not supported.

  • image_name (str or None) – The name of the Docker image used by the controller. This parameter is only used in SLURM mode.

  • recover_mode (str) – The recovery mode. See above for details.

  • recover_retries (int) – The number of retries for recovery. Effective only when recover_mode is set to “auto”.

  • ignore_worker_error (bool) – Whether to ignore errors raised by workers during runtime. Only set this to True if you are certain that the error can be ignored. Effective only when recover_mode is set to “disabled”.

  • allocation_mode (str) – The mode for GPU parallel strategy allocation. See above for details.

  • allocation_use_cache (bool) – Whether to use cache in allocation search. Effective only when allocation_mode is set to “search” and a cache is available in the log directory of the current experiment name and trial.

  • n_nodes (int) – The number of nodes to run the experiment.

  • n_gpus_per_node (int) – The number of GPUs per node. Thus, the total number of GPUs will be n_nodes * n_gpus_per_node. ReaL supports a world size of 1, 2, 4, 8, … within a single node, or multiple nodes with the same number of GPUs.

  • nodelist (str or None) – Nodelist for the distributed setting in SLURM nodelist format. Required for the manual allocation mode. For multiple GPUs on a single node, it should be formatted as “NODE01:0,1,2,3”, indicating the use of the first 4 GPUs on NODE01. For multiple complete nodes, it should be formatted as “NODE[01-02,03,07],COM08”, indicating the use of all GPUs on these nodes: [NODE01, NODE02, NODE03, NODE07, COM08].

  • seed (int) – The random seed.

  • cache_clear_freq (int or None) – The cache of data transfer will be cleared after each cache_clear_freq steps. If None, will not clear the cache. Set to a small number, e.g., 1, if OOM or CUDA OOM occurs.

  • exp_ctrl (ExperimentSaveEvalControl) – The control for saving and evaluating the experiment.

class realhf.SFTConfig(experiment_name: str = '???', trial_name: str = '???', mode: str = 'slurm', debug: bool = True, partition: str = 'dev', wandb_mode: str = 'disabled', image_name: str | None = None, recover_mode: str = 'disabled', recover_retries: int = 1, ignore_worker_error: bool = False, allocation_mode: str = 'pipe_model', allocation_use_cache: bool = False, n_nodes: int = 1, n_gpus_per_node: int = 8, nodelist: str | None = None, seed: int = 1, cache_clear_freq: int | None = 10, exp_ctrl: ~realhf.api.core.system_api.ExperimentSaveEvalControl = <factory>, model: ~realhf.api.quickstart.model.ModelTrainEvalConfig = <factory>, allocation: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, dataset: ~realhf.api.quickstart.dataset.PromptAnswerDatasetConfig = <factory>)

Configuration for SFT experiments.

This class is a subclass of CommonExperimentConfig, so all CLI options from the base class are available.

Parameters:
class realhf.RWConfig(experiment_name: str = '???', trial_name: str = '???', mode: str = 'slurm', debug: bool = True, partition: str = 'dev', wandb_mode: str = 'disabled', image_name: str | None = None, recover_mode: str = 'disabled', recover_retries: int = 1, ignore_worker_error: bool = False, allocation_mode: str = 'pipe_model', allocation_use_cache: bool = False, n_nodes: int = 1, n_gpus_per_node: int = 8, nodelist: str | None = None, seed: int = 1, cache_clear_freq: int | None = 10, exp_ctrl: ~realhf.api.core.system_api.ExperimentSaveEvalControl = <factory>, is_sft_lora: bool = False, sft_lora_path: str | None = None, model: ~realhf.api.quickstart.model.ModelTrainEvalConfig = <factory>, allocation: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, dataset: ~realhf.api.quickstart.dataset.PairedComparisonDatasetConfig = <factory>)

Configuration for pairwise reward modeling experiments.

This class is a subclass of CommonExperimentConfig, so all CLI options from the base class are available.

Parameters:
  • is_sft_lora (bool) – Whether LoRA was used for SFT. If LoRA was used, the saved SFT model should only contain LoRA parameters. Since LoRA is currently not supported for SFT, this option is not utilized at present.

  • sft_lora_path (str or None) – Path to the LoRA model for SFT. Since LoRA is currently not supported for SFT, this option is not utilized at present.

  • model (ModelTrainEvalConfig) – Configuration for model runtime.

  • allocation (MFCConfig) – Configuration for device allocation and parallelism.

  • dataset (PairedComparisonDatasetConfig) – Configuration for the dataset.

class realhf.DPOConfig(experiment_name: str = '???', trial_name: str = '???', mode: str = 'slurm', debug: bool = True, partition: str = 'dev', wandb_mode: str = 'disabled', image_name: str | None = None, recover_mode: str = 'disabled', recover_retries: int = 1, ignore_worker_error: bool = False, allocation_mode: str = 'pipe_model', allocation_use_cache: bool = False, n_nodes: int = 1, n_gpus_per_node: int = 8, nodelist: str | None = None, seed: int = 1, cache_clear_freq: int | None = 10, exp_ctrl: ~realhf.api.core.system_api.ExperimentSaveEvalControl = <factory>, is_sft_lora: bool = False, sft_lora_path: str | None = None, actor: ~realhf.api.quickstart.model.ModelTrainEvalConfig = <factory>, ref: ~realhf.api.quickstart.model.ModelTrainEvalConfig = <factory>, actor_train: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, ref_inf: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, dataset: ~realhf.api.quickstart.dataset.PairedComparisonDatasetConfig = <factory>, beta: float = 0.1)

Configuration for Direct Preference Optimization (DPO) experiments.

This class is a subclass of CommonExperimentConfig, so all CLI options from the base class are available.

Note that runtime evaluation is not implemented for DPO.

Parameters:
  • is_sft_lora (bool) – Whether LoRA was used for SFT. If LoRA was used, the saved SFT model should only contain LoRA parameters. Since LoRA is currently not supported for SFT, this option is not utilized at present.

  • sft_lora_path (str or None) – Path to the LoRA model for SFT. Since LoRA is currently not supported for SFT, this option is not utilized at present.

  • actor (ModelTrainEvalConfig) – Runtime configuration for the primary LLM.

  • ref (ModelTrainEvalConfig) – Runtime configuration for the reference LLM. This model is used only for inference to provide KL regularization. In ReaL, this model is automatically offloaded to CPU, making DPO training as efficient as training a single LLM.

  • actor_train (MFCConfig) – Device allocation and parallelism configuration for training on the primary LLM.

  • ref_inf (MFCConfig) – Device allocation and parallelism configuration for inference on the reference LLM. This configuration can differ from the training allocation. A larger data parallel degree with additional pipelining can improve inference performance.

  • dataset (PairedComparisonDatasetConfig) – Configuration for the dataset, which is the same as for reward modeling.

  • beta (float) – KL regularization coefficient.

class realhf.GenerationHyperparameters(max_new_tokens: int = 256, min_new_tokens: int = 256, greedy: bool = False, top_p: float = 0.9, top_k: int = 200, temperature: float = 1.0, use_cuda_graph: bool = False, force_cudagraph_recapture: bool = True, force_no_logits_mask: bool = False)

Generation hyperparameters.

We implement a customized generation function instead of using HuggingFace’s to support pipelined generation. As a result, advanced generation techniques like diversity-promoting sampling or repetition penalty are not supported during PPO training. However, we do not find this to be a problem in practice. Increasing the sampling temperature and enabling top-k/top-p sampling can produce effective models.

Parameters:
  • max_new_tokens (int) – The maximum number of new tokens to generate.

  • min_new_tokens (int) – The minimum number of new tokens to generate.

  • greedy (bool) – Whether to use greedy decoding.

  • top_k (int) – The number of highest probability tokens to keep.

  • top_p (float) – The cumulative probability of the highest probability tokens to keep.

  • temperature (float) – The temperature of the sampling process.

  • use_cuda_graph (bool) – Whether to use CUDA graph to reduce kernel launch overhead during generation.

  • force_cudagraph_recapture (bool) – Whether to capture the CUDA graph every time generate is called, even if the graph has been captured before. This will introduce minor overhead but will release the kvcache when not running generation.

  • force_no_logits_mask (bool) – Whether to omit the logits mask. The logits mask is produced when using top-k or top-p sampling, marking tokens that are filtered out. This mask is used by the reference model and the actor model during training to align inferred logits with those during generation and produce accurate KLs. Using the logits mask with top-k/top-p sampling greatly improves the stability of PPO training by narrowing the action space. However, this benefit comes at the cost of additional GPU memory usage. If this option is set to True, the logits mask will be omitted to save GPU memory, which may lead to a decrease in learning performance.

class realhf.PPOHyperparameters(gen: ~realhf.api.core.model_api.GenerationHyperparameters = <factory>, ppo_n_minibatches: int = 4, kl_ctl: float = 0.1, discount: float = 1.0, gae_lambda: float = 1.0, eps_clip: float = 0.2, value_eps_clip: float = 0.2, max_reward_clip: float = 20.0, reward_output_scaling: float = 1.0, reward_output_bias: float = 0.0, early_stop_imp_ratio: float = 5.0, use_adaptive_kl_ctl: bool = False, adv_norm: bool = True, value_norm: bool = True, value_norm_type: str = 'exp', value_norm_beta: float = 0.99995, value_norm_eps: float = 1e-05)

Configuration for PPO hyperparameters.

Parameters:
  • gen (GenerationHyperparameters) – Hyperparameters for generation.

  • ppo_n_minibatches (int) – Number of minibatches in each PPO update.

  • kl_ctl (float) – Coefficient for KL divergence rewards.

  • discount (float) – Discount factor for future rewards.

  • gae_lambda (float) – Lambda factor used in Generalized Advantage Estimation (GAE).

  • eps_clip (float) – Clipping factor for the PPO actor probability ratio.

  • value_eps_clip (float) – Clipping factor for the PPO value function.

  • max_reward_clip (float) – Maximum reward value after clipping.

  • reward_output_scaling (float) – Scaling factor for the reward model output.

  • reward_output_bias (float) – Bias for the reward model output. The output of the reward model will be clipped to the range [-max_reward_clip, max_reward_clip] after applying the scaling and bias: CLIP((x - bias) * scaling, -max_reward_clip, max_reward_clip).

  • early_stop_imp_ratio (float) – Maximum value of the importance ratio. PPO updates will be early stopped if the ratio exceeds this value.

  • use_adaptive_kl_ctl (bool) – Whether to use an adaptive KL divergence coefficient.

  • adv_norm (bool) – Whether to normalize the advantage estimates.

  • value_norm (bool) – Whether to denormalize values and normalize return predictions.

  • value_norm_type (str) – Type of value normalization. Can be either “exp” for exponential moving average or “ma” for moving average.

  • value_norm_beta (float) – Exponential decay factor for the exponential moving average.

  • value_norm_eps (float) – Epsilon factor in the denominator of the exponential moving average.

class realhf.PPOConfig(experiment_name: str = '???', trial_name: str = '???', mode: str = 'slurm', debug: bool = True, partition: str = 'dev', wandb_mode: str = 'disabled', image_name: str | None = None, recover_mode: str = 'disabled', recover_retries: int = 1, ignore_worker_error: bool = False, allocation_mode: str = 'pipe_model', allocation_use_cache: bool = False, n_nodes: int = 1, n_gpus_per_node: int = 8, nodelist: str | None = None, seed: int = 1, cache_clear_freq: int | None = 10, exp_ctrl: ~realhf.api.core.system_api.ExperimentSaveEvalControl = <factory>, is_sft_lora: bool = False, sft_lora_path: str | None = None, is_rew_lora: bool = False, rew_lora_path: str | None = None, rew_head_path: str | None = None, actor: ~realhf.api.quickstart.model.ModelTrainEvalConfig = <factory>, critic: ~realhf.api.quickstart.model.ModelTrainEvalConfig = <factory>, ref: ~realhf.api.quickstart.model.ModelTrainEvalConfig = <factory>, rew: ~realhf.api.quickstart.model.ModelTrainEvalConfig = <factory>, actor_train: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, critic_train: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, actor_gen: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, critic_inf: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, rew_inf: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, ref_inf: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, dataset: ~realhf.api.quickstart.dataset.PromptOnlyDatasetConfig = <factory>, ppo: ~realhf.experiments.common.ppo_exp.PPOHyperparameters = <factory>)

Configuration for PPO experiments.

This class is a subclass of CommonExperimentConfig, so all CLI options from the base class are available.

Note that runtime evaluation is not implemented for PPO.

The RLHF process involves four distinct models with independent parameters and six model function calls:

The four models are:

  • Actor: The primary LLM that generates text.

  • Critic: The value function that estimates the value of a state.

  • Ref: The reference LLM that provides KL regularization.

  • Rew: The reward model that provides reward signals.

The six model function calls and their dependencies are:

  • Rollout: Generate text from the actor model.

  • InfReward: Infer rewards from the reward model based on generated text.

  • InfRef: Infer log probabilities from the reference model based on generated text.

  • InfValues: Infer values from the critic model based on generated text.

  • TrainActor: Train the actor model using generated text, rewards, values, and reference log probabilities.

  • TrainCritic: Train the critic model using generated text, rewards, values, and reference log probabilities.

This class manages these dependencies internally. Users should specify the runtime configurations of the models and the allocations for each model function call.

Parameters:
  • is_sft_lora (bool) – Whether LoRA was used for SFT. If LoRA was used, the saved SFT model should only contain LoRA parameters. Since LoRA is currently not supported for SFT, this option is not utilized at present.

  • sft_lora_path (str or None) – Path to the LoRA model for SFT. Since LoRA is currently not supported for SFT, this option is not utilized at present.

  • is_rw_lora (bool) – Whether LoRA was used for reward modeling. If LoRA was used, the saved reward model should only contain LoRA parameters and the new reward head. Since LoRA is currently not supported for reward modeling, this option is not utilized at present.

  • rw_lora_path (str or None) – Path to the LoRA model for reward modeling. Since LoRA is currently not supported for reward modeling, this option is not utilized at present.

  • rew_head_path (str or None) – Path to the new reward head for reward modeling. Since LoRA is currently not supported for reward modeling, this option is not utilized at present.

  • actor (ModelTrainEvalConfig) – Runtime configuration for the primary LLM.

  • critic (ModelTrainEvalConfig) – Runtime configuration for the critic model of PPO.

  • ref (ModelTrainEvalConfig) – Runtime configuration for the reference LLM.

  • rew (ModelTrainEvalConfig) – Runtime configuration for the reward LLM.

  • actor_train (MFCConfig) – MFCConfig for the TrainActor function call.

  • critic_train (MFCConfig) – MFCConfig for the TrainCritic function call.

  • actor_gen (MFCConfig) – MFCConfig for the Rollout function call.

  • critic_inf (MFCConfig) – MFCConfig for the InfValues function call.

  • rew_inf (MFCConfig) – MFCConfig for the InfReward function call.

  • ref_inf (MFCConfig) – MFCConfig for the InfRef function call.

  • dataset (PromptOnlyDatasetConfig) – Configuration for the dataset.

  • ppo (PPOHyperparameters) – Configuration for the PPO algorithm.

class realhf.GenerationConfig(experiment_name: str = '???', trial_name: str = '???', mode: str = 'slurm', debug: bool = True, partition: str = 'dev', wandb_mode: str = 'disabled', image_name: str | None = None, recover_mode: str = 'disabled', recover_retries: int = 1, ignore_worker_error: bool = False, allocation_mode: str = 'pipe_model', allocation_use_cache: bool = False, n_nodes: int = 1, n_gpus_per_node: int = 8, nodelist: str | None = None, seed: int = 1, cache_clear_freq: int | None = 10, exp_ctrl: ~realhf.api.core.system_api.ExperimentSaveEvalControl = <factory>, model: ~realhf.api.quickstart.model.ModelTrainEvalConfig = <factory>, gen: ~realhf.api.core.model_api.GenerationHyperparameters = <factory>, dataset: ~realhf.api.quickstart.dataset.PromptOnlyDatasetConfig = <factory>, allocation: ~realhf.api.quickstart.device_mesh.MFCConfig = <factory>, output_file: str = 'output.jsonl')

Configuration for generation experiments.

This class is a subclass of CommonExperimentConfig, so all CLI options from the base class are available.

Parameters:

Model Configurations

class realhf.ModelFamily(_class: str, size: int = 0, is_critic: bool = False)

An identifier for the HF model type, such as llama, gpt2, etc.

Parameters:
  • _class (str) – The class of the model, e.g., “llama”. This is the registered name in the register_hf_family function. Please refer to the files in realhf/api/from_hf for a list of all supported models.

  • size (int) – The size of the model. This parameter is only used by the search allocation mode and will be ignored otherwise.

  • is_critic (bool) – Indicates whether the model is a critic or reward model, as opposed to a standard LLM.

class realhf.ModelTrainEvalConfig(type: ~realhf.api.core.config.ModelFamily = llama-7, backend: str = 'megatron', path: str = '', lora: ~realhf.api.quickstart.model.LoRAConfig | None = None, gradient_checkpointing: bool = True, enable_fp16: bool = True, enable_bf16: bool = False, offload: bool = False, zero_stage: int = 2, optimizer: ~realhf.api.quickstart.model.OptimizerConfig | None = <factory>, init_critic_from_actor: bool = False)

Runtime configuration for models (or LLMs) in ReaL.

We use a customized model class instead of HuggingFace’s. This customized model has the following highlights:

  1. Support for 3D parallelism and sequence parallelism.

  2. Support for flash attention during both training and generation.

  3. Input sequences are packed into a single 1D tensor to save GPU memory and improve efficiency.

Consequently, each HuggingFace model of interest needs to be manually converted to this customized model. Implemented models can be found in the realhf/api/from_hf/ directory.

Parameters:
  • type (ModelFamily) – Model family type, e.g., llama, qwen2, etc.

  • backend (str) – Backend for training. Currently, only “megatron” and “deepspeed” are supported. Use “deepspeed” for offloading parameters or optimizer states, and “megatron” for parameter reallocation.

  • path (str) – Path of the HuggingFace checkpoint.

  • lora (Optional[LoRAConfig]) – Whether to use LoRA (Low-Rank Adaptation).

  • gradient_checkpointing (bool) – Whether to use gradient checkpointing to save memory.

  • enable_fp16 (bool) – Whether to use fp16 precision.

  • enable_bf16 (bool) – Whether to use bf16 precision. Mutually exclusive with fp16.

  • offload (bool) – Whether to offload model parameters to CPU. Only valid for the DeepSpeed backend.

  • parallel (ParallelismConfig) – Configuration for parallelism.

  • zero_stage (int) – Stage of ZeRO optimization. Should be one of 0, 1, 2, or 3.

  • optimizer (Optional[OptimizerConfig]) – Configuration for the optimizer.

  • init_critic_from_actor (bool) – Whether to initialize a critic/reward model from a saved LM checkpoint.

class realhf.OptimizerConfig(type: str = 'empty', lr: float = 1e-05, weight_decay: float = 0.05, beta1: float = 0.9, beta2: float = 0.95, eps: float = 1e-05, min_lr_ratio: float = 0.0, lr_scheduler_type: str = 'cosine', warmup_steps_proportion: float = 0.02, offload: bool = False)

Configuration for the optimizer.

For models that will not be trained, the optimizer type should be set to “empty”.

Parameters:
  • type (str) – Type of optimizer. Currently, only “adam” and “empty” optimizers are supported.

  • lr (float) – Learning rate.

  • weight_decay (float) – Weight decay.

  • beta1 (float) – Adam beta1 parameter.

  • beta2 (float) – Adam beta2 parameter.

  • eps (float) – Adam epsilon parameter in the denominator.

  • min_lr_ratio (float) – Minimum learning rate ratio after learning rate annealing. Should be in the interval [0.0, 1.0].

  • lr_scheduler_type (str) – Type of learning rate scheduler. One of “linear”, “cosine”, or “constant”.

  • warmup_steps_proportion (float) – Proportion of total training steps allocated for warming up. Should be in the interval [0.0, 1.0].

  • offload (bool) – Whether to offload the optimizer to CPU. Only valid for the DeepSpeed backend.

class realhf.ParallelismConfig(model_parallel_size: int = 1, pipeline_parallel_size: int = 1, data_parallel_size: int = 1, use_sequence_parallel: bool = False)

Configuration for 3D parallelism.

Parameters:
  • model_parallel_size (int) – Size of tensor-model parallelism.

  • pipeline_parallel_size (int) – Number of pipeline parallelism stages.

  • data_parallel_size (int) – Data parallelism size for ZeRO optimization.

  • use_sequence_parallel (bool) – Whether to use sequence parallelism in Megatron in combination with tensor-model parallelism.

class realhf.MFCConfig(n_mbs: int | None = None, parallel: ~realhf.api.quickstart.model.ParallelismConfig = <factory>, device_mesh: str | None = None)

Configuration for a single MFC.

Parameters:
  • n_mbs (Optional[int]) – Number of micro-batches when executing this MFC. Refer to MFCDef for details.

  • parallel (ParallelismConfig) – Configuration for the parallelism strategy. This is used only for manual allocation.

  • device_mesh (Optional[str]) – String representation of the device mesh. If it consists of multiple nodes, it should be formatted as a SLURM nodelist, e.g., node[01-02] or node01,node02. If it represents a slice on a single node, it should occupy 1, 2, 4, or 8 contiguous GPUs on the node. In this case, the string representation is similar to an MPI hostfile, e.g., “node01:0,1,2,3” for the first 4 GPUs on node01. This is used only for manual allocation.

class realhf.ReaLModelConfig(n_layers: int, n_kv_heads: int, n_q_heads: int, hidden_dim: int, intermediate_dim: int, vocab_size: int, head_dim: int | None = None, n_positions: int | None = None, embd_pdrop: float = 0.1, resid_pdrop: float = 0.1, attn_pdrop: float = 0.1, layer_norm_epsilon: float = 1e-05, activation_function: str = 'gelu', scale_attn_by_inverse_layer_idx: bool = True, scale_attn_weights: bool = True, use_attention_bias: bool = True, use_attn_proj_bias: bool = True, layer_norm_type: str | None = None, mlp_type: str | None = None, apply_rotary: bool = False, rotary_base: float = 10000.0, rotary_interleaved: bool = False, rotary_scaling: float | None = None, rotary_scaling_type: str | None = None, normalize_embed: bool = False, abs_position_embedding_offset: int = 0, do_layernorm_before: bool = True, tied_embedding: bool = False, sliding_window: int | None = None, moe: ReaLMoEConfig | None = None, is_critic: bool = False)

Configuration for the ReaLModel.

Parameters:
  • n_layers (int) – The number of transformer blocks.

  • n_kv_heads (int) – The number of key-value attention heads.

  • n_q_heads (int) – The number of query attention heads.

  • head_dim (int or None) – The dimension of each attention head. If None, it defaults to hidden_dim // n_q_heads. If specified, the query layer will have the shape (hidden_dim, head_dim * n_q_heads).

  • hidden_dim (int) – The hidden dimension of the transformer block.

  • intermediate_dim (int) – The dimension of the intermediate layer in the MLP.

  • vocab_size (int) – The vocabulary size.

  • n_positions (Optional[int]) – The maximum context length. Can be None for rotary embedding, where the context length is determined during runtime.

  • embd_pdrop (float) – The dropout probability for the embedding layer.

  • resid_pdrop (float) – The dropout probability for the residual connections.

  • attn_pdrop (float) – The dropout probability for the attention weights.

  • layer_norm_epsilon (float) – The epsilon value for layer normalization.

  • activation_function (str) – The activation function for the MLP.

  • scale_attn_by_inverse_layer_idx (bool) – Whether to scale the attention weights by the inverse of the layer index.

  • use_attention_bias (bool) – Whether to use bias for QKV layers.

  • use_attn_proj_bias (bool) – Whether to use bias for the attention projection layer.

  • layer_norm_type (Optional[str]) – The type of layer normalization. Can be None, “rms”, or “gemma”.

  • mlp_type (Optional[str]) – The type of the MLP. Can be None, “llama”, or “moe”.

  • apply_rotary (bool) – Whether to apply rotary embedding.

  • rotary_base (float) – The exponential base for the rotary embedding.

  • rotary_interleaved (bool) – Whether to use interleaved rotary embedding.

  • rotary_scaling (Optional[float]) – The scaling factor for the rotary embedding.

  • rotary_scaling_type (Optional[str]) – The type of scaling for the rotary embedding.

  • normalize_embed (bool) – Whether to normalize the embeddings before passing them through the transformer blocks. Used by Gemma.

  • abs_position_embedding_offset (int) – The offset for the absolute position embedding. Used by OPT, but OPT is currently not supported.

  • do_layernorm_before (bool) – Whether to apply layer normalization before the attention rather than after. Used by OPT, but OPT is currently not supported.

  • tied_embedding (bool) – Whether to share the embeddings and output weights. Used by models like GPT-2 and Gemma.

  • sliding_window (Optional[int]) – The sliding window size for the attention. Currently a placeholder and not supported.

  • moe (Optional[ReaLMoEConfig]) – Configuration for MoE models, only effective when mlp_type=”moe”.

  • is_critic (bool) – Whether the model is a critic model.

Dataset Configurations

class realhf.PromptAnswerDatasetConfig(train_path: str = '', valid_path: str = '', max_seqlen: int = 1024, train_bs_n_seqs: int = 256, valid_bs_n_seqs: int = 256, pad_to_max_length: bool = False)

Configuration for datasets used in Supervised Fine-Tuning (SFT).

The raw data must be in a JSON or JSONL file format, where each entry is a dictionary with the keys prompt and answer. Both prompt and answer must be strings.

Parameters:
  • train_path (str) – Path to the training dataset.

  • valid_path (str) – Path to the validation dataset.

  • max_seqlen (int) – Maximum sequence length (prompt + answer). Sequences longer than this will be truncated.

  • train_bs_n_seqs (int) – Number of sequences in each batch during training.

  • valid_bs_n_seqs (int) – Number of sequences in each batch during validation.

  • pad_to_max_length (bool) – Whether to pad sequences to the maximum length. If True, all mini-batches created by the DP balanced partitioning algorithm will have the same number of tokens, making MFC time predictable. This option is used only for benchmarking purposes.

class realhf.PairedComparisonDatasetConfig(train_path: str = '', valid_path: str = '', max_pairs_per_prompt: int = 2, max_seqlen: int = 1024, train_bs_n_seqs: int = 256, valid_bs_n_seqs: int = 256)

Configuration for datasets used in paired-comparison reward modeling, DPO, and SimPO.

The raw data must be in a JSON or JSONL file format, where each entry is a dictionary with the keys prompt, pos_answers, and neg_answers. prompt is a string, while pos_answers and neg_answers are lists of strings. The lists must have the same length.

The raw dataset may contain multiple answer pairs for each prompt. In each epoch, we will randomly sample max_pairs_per_prompt answer pairs for each prompt, so the maximum batch size (in terms of the number of sequences) per step is train_bs_n_seqs multiplied by max_pairs_per_prompt.

Parameters:
  • train_path (str) – Path to the training dataset.

  • valid_path (str) – Path to the evaluation dataset.

  • max_pairs_per_prompt (int) – Maximum number of answer pairs per prompt.

  • max_seqlen (int) – Maximum sequence length (prompt + answers). Sequences longer than this will be truncated.

  • train_bs_n_seqs (int) – Number of sequences in each batch during training.

  • valid_bs_n_seqs (int) – Number of sequences in each batch during validation.

class realhf.PromptOnlyDatasetConfig(path: str = '', max_prompt_len: int = 256, train_bs_n_seqs: int = 256, pad_to_max_length: bool = False)

Configuration for datasets used in PPO RLHF.

The raw data must be in a JSON or JSONL file format, where each entry is a dictionary with a single key called prompt, which is a string.

Parameters:
  • path (str) – Path to the dataset.

  • max_prompt_len (int) – Maximum length of the prompt. Prompts longer than this will be truncated.

  • train_bs_n_seqs (int) – Number of prompts in each batch.

  • pad_to_max_length (bool) – Whether to pad prompts to the maximum length. If True, all mini-batches created by the DP balanced partitioning algorithm will have the same number of tokens, making MFC time predictable. This option is used only for benchmarking purposes.

Data Structure for Interfaces and Datasets

class realhf.SequenceSample(keys: ~typing.Set[str], trailing_shapes: ~typing.Dict[str, ~torch.Size | ~typing.Tuple | None], dtypes: ~typing.Dict[str, ~torch.dtype | None], ids: ~typing.List[~typing.Hashable], seqlens: ~typing.Dict[str, ~typing.List[~typing.List[int]]], data: ~typing.Dict[str, ~torch.Tensor | None] | None = None, metadata: ~typing.Dict[str, ~typing.List[~typing.Any]] = <factory>)

The data structure used to represent sequence data.

Each piece of data is assumed to have several “keys” (like a dictionary), with each key potentially corresponding to multiple sequences.

For example, when running PPO, multiple responses can be generated for each prompt. If there are 2 prompts, each with 3 responses, the batch might look like:

>>> s = SequenceSample(...)
>>> s.keys
{'resp', 'prompt'}
>>> s.seqlens
{'prompt': [[13], [6]], 'resp': [[6, 17, 15], [13, 15, 13]]}
>>> s.data
{'prompt': torch.tensor([...]), 'resp': torch.tensor([...])}

Key points:

  • Data with different batch indices can have varying lengths (e.g., the first prompt has a length of 13 while the second has a length of 6).

  • A key (e.g., “response”) can correspond to multiple sequences with different lengths. Additionally, the number of sequences for each key can differ from the number of sequences for the data. For example, the first prompt may have 2 responses, and the second may have 3.

  • Regardless of the batch size or the number of sequences stored for each key, the data is concatenated into a 1D tensor. The outer dimension represents the batch size, and the inner dimension represents the number of sequences for the key.

This data structure facilitates easy gathering, splitting, and transferring of non-padded batches between different GPUs.

Parameters:
  • keys (Set[str]) – The keys of the data.

  • trailing_shapes (Dict[str, torch.Size | Tuple | None]) – The trailing shapes of the data, excluding the first dimension, which must be the sequence length. Used to construct the receiving buffer for data transfer.

  • dtypes (Dict[str, torch.dtype | None]) – The types of the data. Used to construct the receiving buffer for data transfer.

  • ids (List[Hashable]) – Unique identifiers for each piece of data. Should be provided in the dataset implementation. Used to append new data to the buffer after a model function call.

  • seqlens (Dict[str, List[List[int]]]) – The sequence lengths of each sequence in the data. For a given key, this should be a list of lists of integers. The outer list represents the batch size, while the inner lists represent the sequence lengths for this key. Python-native lists are used here because (1) pickling torch.Tensor or numpy array is inefficient, and (2) the size of the inner lists can vary across the batch, making 2D arrays impractical.

  • data (Optional[Dict[str, torch.Tensor | None]]) – The actual concatenated data. If this is None, the sample is a metadata-only sample used by the master worker. The specification of the data should be consistent with the seqlens, dtypes, and trailing_shapes.

  • metadata (Dict[str, List[Any]]) – Metadata for the sample. It should be a dictionary of lists, provided in the dataset implementation. Note that adding metadata can slow down data transfer.

property bs

The batch size or the number of data pieces in the sample.

cuda()

Move the data to GPU inplace.

classmethod disable_validation()

Disable the expensive pydantic validation within this context.

Used to accelerate gather/split/transfer operations since we have ensured that the data created in datasets and interfaces are valid.

classmethod from_default(seqlens: List[int], ids: List[Hashable], data: Dict[str, Tensor], metadata: Dict[str, Any] | None = None)

Construct a SequenceSample object from default parameters.

This helper function is intended for cases where each piece of data has a single sequence length (e.g., a single response for each prompt). The sequence lengths for different keys are resolved automatically according to the rules in _resolve_seqlen_from_key. While this function can reduce boilerplate code, it may introduce potential bugs, so it should be used with caution.

Parameters:
  • seqlens (List[int]) – The sequence lengths of each piece of data. This represents the length of the main attribute (e.g., packed_input_ids). Sequence lengths for other attributes (e.g., rewards and logprobs) are computed from this parameter. It is NOT the actual length of rewards or logprobs even if it is the only key in the data.

  • ids (List[Hashable]) – Unique identifiers for each piece of data.

  • data (Dict[str, torch.Tensor]) – The actual data.

  • metadata (Optional[Dict[str, Any]]) – Metadata for the sample. Should be a dictionary where each value is a list with a length equal to the number of sequence lengths.

classmethod gather(samples: List[SequenceSample], keys: List[str] | None = None)

Gather a list of SequenceSample objects into a single batch.

Parameters:
  • samples (List[SequenceSample]) – A list of SequenceSample objects to be gathered.

  • keys (Optional[List[str]]) – The keys to be gathered. Only a subset of keys can be gathered. If None, the keys from the first sample will be used.

get_split_spec(k: int, key: str | None = None, min_size: int = 1) SequenceSplitSpec

Get the partition specification for splitting the data into k parts using a dynamic programming algorithm to achieve the most balanced partitioning.

Parameters:
  • k (int) – The number of parts to split the data into.

  • key (Optional[str]) – The key to be used for splitting. If None, the key with the largest total sequence length will be used.

  • min_size (int) – The minimum size of each partition.

Returns:

A SequenceSplitSpec object representing the partitioning specification.

Return type:

SequenceSplitSpec

meta() SequenceSample

Create a new SequenceSample that does not contain any data.

remap_keys_(remap: Dict[str, str])

Inplace remap keys of the data.

Useful for reusing the same interface implementation in different algorithms, where the data can be named differently.

split(k: int, key: str | None = None, min_size: int = 1) List[SequenceSample]

Split the data into k parts.

This method uses the specified key or the key with the largest total sequence length to split the data into k parts. The partitioning ensures that each part meets the minimum size requirement.

Parameters:
  • k (int) – The number of parts to split the data into.

  • key (Optional[str]) – The key to use for splitting. If None, the key with the largest total sequence length will be used.

  • min_size (int) – The minimum size of each partition.

Returns:

A list of SequenceSample objects, each representing a part of the split data.

Return type:

List[SequenceSample]

split_with_spec(spec: SequenceSplitSpec) List[SequenceSample]

Split the data according to the given spec.

unpack()

Unpack a batch of data into individual pieces of data.

update_(other: SequenceSample)

Inplace update data from another SequenceSample.

Used to amend newly produced data after a model function call.

Dataflow Graph

class realhf.MFCDef(name: str, n_seqs: int, interface_type: ~realhf.api.core.config.ModelInterfaceType, interface_impl: ~realhf.api.core.config.ModelInterfaceAbstraction, model_name: str | ~realhf.api.core.config.ModelName, input_keys: ~typing.Tuple = <factory>, input_key_remap: ~typing.Dict[str, str] = <factory>, output_keys: ~typing.Tuple = <factory>, output_key_remap: ~typing.Dict[str, str] = <factory>, n_mbs: int | None = None, balanced_dp: bool = False, log_return_value: bool = False, model_type: ~typing.Any | ~realhf.api.core.config.ModelFamily | None = None, model_path: str | None = None, _G: ~networkx.classes.digraph.DiGraph | None = None, _pre_hooks: ~typing.List[~realhf.api.core.dfg.OffloadHook | ~realhf.api.core.dfg.ParamReallocHook] = <factory>, _post_hooks: ~typing.List[~realhf.api.core.dfg.OffloadHook | ~realhf.api.core.dfg.ParamReallocHook] = <factory>)

A model function call (MFC) object used by the workers.

MFC stands for Model Function Call. This object serves as the interface for developing new algorithms and will be inserted into an nx.DiGraph as nodes. Edges will be automatically resolved based on input/output keys.

Fields starting with an underscore are filled automatically.

Note: In the ReaL implementation, the term RPC also refers to MFC.

Parameters:
  • name (str) – The unique identifier for this model function call.

  • n_seqs (int) – The number of sequences to be processed in a batch.

  • interface_type (ModelInterfaceType) – The type of interface used by the node (e.g., generate, train_step).

  • interface_impl (ModelInterface) – The actual implementation of the interface when running this node.

  • model_name (str or ModelName) – The model identifier used by the node, corresponding to a unique LLM. The user-provided model name can be a string; the replica ID will be resolved in ReaL.

  • input_keys (Tuple) – Input data keys used to resolve dependencies.

  • output_keys (Tuple) – Output data keys used to resolve dependencies.

  • input_key_remap (Dict[str, str]) – Remap input keys to identifiers recognized by the interface implementation. Keys are from input_keys and values are identifiers known to the interface.

  • output_key_remap (Dict[str, str]) – Remap output keys to identifiers recognized by MFC. Keys are identifiers known to the interface, and values are from output_keys.

  • n_mbs (Optional[int]) – The number of micro-batches when executing this MFC. Defaults to 1 if pipeline parallelism is disabled, or to 2 * pp_size for train_step and pp_size for generate/inference if pipeline parallelism is enabled.

  • balanced_dp (bool) – Whether to balance data parallelism so that each DP rank receives exactly n_seqs // dp_size sequences. If False, ReaL will partition according to the number of tokens. This may lead to unbalanced sequence numbers if sequence lengths are not uniform, but ensures balanced memory usage.

  • log_return_value (bool) – Whether to log the return value of the interface implementation.

  • model_type (Optional[ModelFamily]) – The specification of the LLM, e.g., LLaMA-7B. Used by the profiler and search engine to produce an optimal execution plan. Can be omitted if the search engine is not used.

  • model_path (Optional[str]) – The path to the model file. Used to get the config for the search engine. Can be omitted if the search engine is not used.

System-Level Configurations

Note

These configurations are not supposed to be modified by users. They are used to help understand the code architecture of ReaL.

class realhf.ModelShardID(model_name: ~realhf.api.core.config.ModelName, dp_rank: int, mp_rank: int, pp_rank: int, topo: ~realhf.base.topology.PipeModelDataParallelTopology = <factory>)

The ID of a model shard in a specific model worker.

This ID is essentially a combination of the model name and the 3D parallelism rank, and can be used as a dictionary key. It represents the identity of a “model handler”. The master worker maintains a lookup table mapping the ModelShardID to the model worker index, which can be a many-to-one mapping. Requests are created with the ModelShardID; for example, actors with ranks (dp=*, mp=0, pp=0) should transfer data to the critics. The ModelShardID is then mapped to the model worker index, and the requests are sent to the corresponding model workers.

Parameters:
  • model_name (ModelName) – The name of the model.

  • dp_rank (int) – The data parallel rank.

  • mp_rank (int) – The tensor-model parallel rank.

  • pp_rank (int) – The pipeline-model parallel rank.

  • topo (PipeModelDataParallelTopology) – The 3D parallelism topology of this model.

class realhf.ModelName(role: str, replica_id: int)

A unique identifier for a model.

Parameters:
  • role (str) – The role of the model, e.g., “actor” or “critic”.

  • replica_id (int) – The replica ID of the model. Different replicas of the same role have the same set of parameters but different memory locations. For example, if actor generation and training in PPO use different parallel strategies, they will have the same role but different replica IDs.

class realhf.ModelVersion(epoch: int = 0, epoch_step: int = 0, global_step: int = 0)

A version counter.

Parameters:
  • epoch (int) – The current epoch.

  • epoch_step (int) – The current step within the current epoch. A “step” refers to a traversal of the dataflow graph (DFG), which may include multiple model update steps depending on the interface (e.g., PPO mini-batched updates).

  • global_step (int) – The total number of steps since the start of the experiment.

class realhf.Model(name: ~realhf.api.core.config.ModelName, module: ~realhf.api.core.model_api.PipelinableEngine | ~torch.nn.modules.module.Module, tokenizer: ~transformers.tokenization_utils_fast.PreTrainedTokenizerFast, device: str | ~torch.device, dtype: ~torch.dtype | None = None, version: ~realhf.api.core.model_api.ModelVersion = <factory>, ft_spec: ~realhf.api.core.model_api.FinetuneSpec | None = None)

A collection consisting of a neural network, a tokenizer, and metadata with a unique name.

Parameters:
  • name (ModelName) – The unique name of the model.

  • module (PipelinableEngine | torch.nn.Module) – The neural network module. Its parameters may be sharded by tensor or pipeline parallelism.

  • tokenizer (transformers.PreTrainedTokenizerFast) – The tokenizer associated with the model.

  • device (Union[str, torch.device]) – The device on which to run the model.

  • dtype (Optional[torch.dtype]) – The data type of the model. Defaults to torch.float16 if None.

  • version (ModelVersion) – The version of the model.

  • ft_spec (FinetuneSpec) – The fine-tuning specification for the model. Generally not used.

class realhf.ModelBackend

A backend that wraps Model to provide additional functionalities such as pipelined model function calls and ZeRO optimization.

Current backend implementations include inference, DeepSpeed, and Megatron. The inference backend provides only inference and generation APIs, while the DeepSpeed and Megatron backends also support training.

The backend offers two main functionalities:

  1. Pipelined generation, inference, and training, implemented in ReaL.

  2. ZeRO optimization, implemented in DeepSpeed and Megatron.

After initialization, the module attribute in Model will have the same signature as PipelinableEngine. See realhf/impl/model/backend for concrete implementations.

destroy(model: Model)

Destroy the backend and release GPU memory.

initialize(model: Model, spec: FinetuneSpec) Model

Initialize the model with the backend to support pipelining and distributed optimization.

class realhf.PipelinableEngine

Defines the signature for modules after backend initialization.

Modules with this signature will be passed to ModelInterface for model function call execution.

See inference.py, deepspeed.py, and megatron.py for concrete implementations.

eval_batch(input_: SequenceSample, loss_fn: Callable[[Tensor, SequenceSample], Tuple[Tensor, Dict]], num_micro_batches: int | None = None) Tuple[Tensor, Dict] | None

Evaluate the model using the forward pass and loss function.

This method wraps forward() with a customized post_hook and aggregate_fn.

Parameters:
  • input (SequenceSample) – The input data. It should contain at least the key packed_input_ids, which includes the concatenated token sequences. It should also include any other entries required to compute the loss.

  • loss_fn (Callable[[torch.Tensor, SequenceSample], Tuple[torch.Tensor, Dict]]) – The loss function. It takes the output of the forward pass and the input data, returning the loss and a dictionary of statistics.

  • num_micro_batches (Optional[int]) – The number of micro-batches to split the batch into. This argument is retained for compatibility, although it should not be used, since different batch sizes can be directly set in the dataloader, and batch size during evaluation does not impact algorithmic performance like it does during training.

Returns:

The aggregated scalar loss and a dictionary of statistics from the last pipeline stage. Returns None otherwise.

Return type:

Tuple[torch.Tensor, Dict]

forward(input_: ~realhf.api.core.data_api.SequenceSample, num_micro_batches: int | None = None, post_hook: ~typing.Callable[[~torch.Tensor, ~realhf.api.core.data_api.SequenceSample], ~typing.Any] | None = None, aggregate_fn: ~typing.Callable[[~typing.List[~typing.Any]], ~typing.Any] = <built-in method cat of type object>) Any | None

Run the forward pass or inference on the model. Note that it is gradient-free.

To train the model, use train_batch() instead.

Parameters:
  • input (SequenceSample) – The input data. It should contain at least the key packed_input_ids, which includes the concatenated token sequences.

  • num_micro_batches (Optional[int]) – The number of micro-batches to split the batch into. Regardless of pipelining, mini-batches will be fed into the module one-by-one. This approach helps reduce GPU memory usage of hidden states. If None, the batch will not be split.

  • post_hook (Callable[[torch.Tensor, SequenceSample], Any] | None) – A function to apply to the output after the forward pass. It takes the output tensor and the input data, returning an arbitrary result. With a post_hook, we can process the output in mini-batches, reducing memory usage for operations such as gathering log-probabilities. If None, this function just returns the output tensor.

  • aggregate_fn (Callable[[List[Any]], Any]) – A function to aggregate the results of the post_hook.

Returns:

The aggregated result of the post_hook from the last pipeline stage. Returns None otherwise. The output before post_hook is a concatenated tensor along the batch-sequence dimension, similar to packed_input_ids. For example, if we have 3 sequences with lengths [2, 3, 4], and the vocabulary size is 1000, packed_input_ids should have shape [9], and the logits should have shape [9, 1000].

Return type:

Any | None

generate(input_: ~realhf.api.core.data_api.SequenceSample, tokenizer: ~transformers.tokenization_utils_fast.PreTrainedTokenizerFast, gconfig: ~realhf.api.core.model_api.GenerationHyperparameters = Field(name=None, type=None, default=<dataclasses._MISSING_TYPE object>, default_factory=<class 'realhf.api.core.model_api.GenerationHyperparameters'>, init=True, repr=True, hash=None, compare=True, metadata=mappingproxy({}), kw_only=<dataclasses._MISSING_TYPE object>, _field_type=None), num_micro_batches: int | None = None) Tuple[Tensor, Tensor, Tensor | None] | None

Generate outputs from the model.

Parameters:
  • input (SequenceSample) – The input data. It should contain at least the key packed_input_ids, which includes the concatenated prompts.

  • tokenizer (transformers.PreTrainedTokenizerFast) – The tokenizer for the model.

  • gconfig (GenerationHyperparameters) – The generation hyperparameters.

  • num_micro_batches (Optional[int]) – The number of micro-batches to split the batch into. Regardless of pipelining, mini-batches will be processed one-by-one by the module. This approach helps reduce GPU memory usage for hidden states and KV-caches. If None, the batch will not be split.

Returns:

For the last pipeline stage, returns the generated tokens, log probabilities, and optionally the logits mask. See GenerationHyperparameters for more details about the logits mask. Returns None for other stages. The outputs are stacked tensors along the batch dimension. For example, if we have 3 prompts with lengths [2, 3, 4], a maximum generated length of 5, and a vocabulary size of 1000, packed_input_ids should have shape [9], generated tokens and log probabilities should have shape [3, 5], and the logits should have shape [3, 5, 1000].

Return type:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor | None] | None

train_batch(input_: SequenceSample, loss_fn: Callable[[Tensor, SequenceSample], Tuple[Tensor, Dict]], version_steps: int, num_micro_batches: int | None = None) Tuple[Tensor, Dict] | None

Update the model with a batch of data and a loss function.

Parameters:
  • input (SequenceSample) – The input data. It should contain at least the key packed_input_ids, which includes the concatenated token sequences. It should also include any other entries required to compute the loss.

  • loss_fn (Callable[[torch.Tensor, SequenceSample], Tuple[torch.Tensor, Dict]]) – The loss function. It takes the output of the forward pass and the input data, returning the loss and a dictionary of statistics.

  • version_steps (int) – The global step counter for this experiment, used by the backend to determine the learning rate schedule.

  • num_micro_batches (Optional[int]) – The number of micro-batches to split the batch into. Gradients will be accumulated across micro-batches, and only one update will occur. For pipelined training, micro-batches are processed together by the engine, which automatically schedules the forward and backward passes. For non-pipelined training, forward and backward passes are executed iteratively over mini-batches to accumulate gradients. If None, the batch will not be split.

Returns:

The aggregated scalar loss and a dictionary of statistics from the last pipeline stage. Returns None otherwise.

Return type:

Tuple[torch.Tensor, Dict]

class realhf.ModelInterface

An interface for model training, evaluation, inference, and generation.

This interface is designed to follow the dependency injection pattern. We pass the model to the interface and call its methods, ensuring that model APIs and algorithms are fully decoupled. For example, REINFORCE and PPO can exhibit different behaviors during training. Separate interfaces can be written for these algorithms while using the same model that provides basic forward-backward-update functionality (i.e., PipelinableEngine).

During runtime, the master worker requests model workers to execute a specific interface type (e.g., generate) on a specific model. The model worker locates the corresponding model, passes it into the requested interface, performs the computation, and returns the result.

Users can easily create new interfaces to support customized usage. See Customization for more details.

evaluate(model: Model, eval_dataloader: DataLoader) Dict
generate(model: Model, data: SequenceSample, n_mbs: int | None = None) SequenceSample
inference(model: Model, data: SequenceSample, n_mbs: int | None = None) SequenceSample
mock(type_: str, model: Model, data: SequenceSample) SequenceSample
save(model: Model, save_dir: str)
train_step(model: Model, data: SequenceSample, n_mbs: int | None = None) Dict