Setting Up Distributed Experiments

Currently, ReaL supports launching distributed experiments using SLURM with the Pyxis plugin, or using Ray. We recommend Ray for its simplicity and flexibility.

The requirements.txt file includes the Ray package, and it should be installed by following our instructions in the Installation section. Pyxis and SLURM, however, need to be installed and set up separately by following their respective documentation.

Cluster Spec

Before running experiments, ReaL needs to know some information about the cluster. You need to create a JSON cluster configuration file, as shown in the example in examples/cluster_config.json.

Note

This file is necessary for SLURM, but optional for Ray.

  • cluster_type: The type of the cluster, either “slurm” or “ray”.

  • cluster_name: The name of the cluster. This can be arbitrary.

  • fileroot: An NFS path accessible by all nodes, where logs and checkpoints will be stored. By default, ReaL assumes that the /home/ directory is shared by all nodes in the cluster. ReaL will not run correctly if this path is not accessible by all nodes.

  • default_mount: A comma-separated list of paths to mount on all nodes, including the fileroot mentioned above. This is only used by SLURM.

  • node_type_from_node_name: A dictionary mapping a regular expression to a node type. Every host in this cluster should match one of these regular expressions. Node types include [“g1”, “g2”, “g8”, “a100”], where “g” refers to low-end GPUs in the cluster. This is only used by SLURM.

  • gpu_type_from_node_name: A dictionary mapping a regular expression to a GPU type. This is used by SLURM.

  • cpu_image: The Docker image for the controller and the master worker. This is only used by SLURM.

  • gpu_image: The Docker image for the model worker. This is only used by SLURM.

  • node_name_prefix: The prefix of the host names. We assume that host names in the cluster are prefixed by a string followed by an integer, e.g., “com-01”, where “com-” is the prefix. If not provided, the default prefix is “NODE”, i.e., NODE01, NODE02, etc.

After creating this file, specify it using the CLUSTER_SPEC_PATH environment variable when launching experiments. For example:

CLUSTER_SPEC_PATH=/tmp/my-cluster.json python3 -m realhf.apps.quickstart ppo ...

Note

If the CLUSTER_SPEC_PATH variable is not set, the Ray mode will use a default fileroot /home/$USER/.cache/realhf and a default node prefix “NODE”.

This means that logs and checkpoints will be saved to /home/$USER/.cache/realhf and that the user should specify manual allocations like NODE[01-02].

It is the user’s responsibility to ensure that the fileroot /home/$USER/.cache/realhf is accessible by all nodes in the cluster.

Distributed Experiments with Ray

Assume you have installed Ray via PyPI on every node in your cluster. Ensure that the version of Ray is the same on all nodes. Next, we recommend setting up the Ray cluster with its CLI. You can do this by running the following command:

$ # On the head node
$ ray start --head --port=6379
Local node IP: xxx.xxx.xxx.xxx

--------------------
Ray runtime started.
--------------------

Next steps
To add another node to this Ray cluster, run
    ray start --address='xxx.xxx.xxx.xxx:6379'

To connect to this Ray cluster:
    import ray
    ray.init()

To terminate the Ray runtime, run
    ray stop

To view the status of the cluster, use
    ray status
$ # On the worker nodes
$ ray start --address='xxx.xxx.xxx.xxx:6379'

After setting up the Ray cluster, you can run experiments on the head node by replacing mode=local with mode=ray in the scripts. Now you can change n_nodes, the device allocation, and parallel strategies to scale up the experiments with more than n_gpus_per_node GPUs. No additional changes are required!

We would like to append a few notes on the Ray cluster setup.

Ray Resources

If your cluster is not homogeneous, for example, if the head node is a CPU machine without a GPU, you can specify the resources using the Ray CLI:

# In the head node
$ ray start --head --port=6379 --num-cpus=1 --num-gpus=0 --mem=10000

This command will allocate 1 CPU core, 0 GPUs, and 10GB of memory for the head node. As a result, model workers and the master worker will not be scheduled on the head node.

ReaL will detect all available resources by calling ray.init() on the head node. The driver process that calls ray.init() does not consume any resources. Only the workers (i.e., model workers and the master worker) will consume resources according to the scheduling setup in the experiment configuration.

If there are not enough resources available, Ray jobs will wait until the requested resources become available, and ReaL will prompt a message in the terminal. You can elastically add new nodes with ray start in the cluster to increase resources.

Graceful Shutdown

Nodes in the Ray cluster can be shut down with the ray stop command. Currently, ReaL has an issue where, when the experiment terminates, it only kills the driver process, leaving the worker processes stale on remote nodes.

Note

Users should manually kill the worker processes on the remote nodes using `ray stop`; otherwise, a new experiment on the same Ray cluster will get stuck.

Distributed Experiments with SLURM + Pyxis

After specifying the cluster configuration file, you can run experiments with mode=slurm in the scripts. ReaL’s scheduler will submit jobs to the SLURM resource manager automatically.