Decentralized Training on Ravnest

Ravnest integrates seamlessly with PyTorch, providing a user-friendly interface that simplifies the setup and management of distributed training environments. It allows users to easily scale their training workloads across multiple devices, reducing training times and improving model performance without extensive configuration or specialized knowledge in distributed systems. By democratizing access to powerful training capabilities, Ravnest makes advanced machine learning more accessible and efficient for developers and researchers.

Getting the Main Model Ready

First, define your PyTorch model in the usual way. Ensure it is encapsulated within a class that inherits from torch.nn.Module and includes a defined forward() method.

 1import torch
 2import torch.nn as nn
 3
 4class DL_Model(nn.Module):
 5    def __init__(self):
 6        super(DL_Model, self).__init__()
 7        # Model layers go here
 8        ...
 9
10    def forward(self, x):
11        # Forward pass logic goes here
12        ...
13
14model = DL_Model()

Or, if you’re feeling adventurous, load popular models straight from PyTorch-based libraries:

1from torchvision.models import resnet50
2model = resnet50()

For a scripted affair, load a .pt file using torch.jit.load():

1import torch
2model = torch.jit.load("model_scripted.pt")

As long as you have a PyTorch model instance, you’re all set.

Defining the Provider Nodes

To set up the compute provider nodes, craft a JSON file containing the system specification metadata of each node. In your project directory, create a json file at node_data/node_configs.json.

This file should include information such as the node’s publicly accessible IP address, available compute memory (can be RAM or VRAM), and network bandwidth details. Below is an example structure of the JSON file:

 1{
 2    "0":{
 3        "IP":"<host>:<port>",
 4        "benchmarks":{
 5            "ram":8,
 6            "bandwidth":20
 7        }
 8    },
 9    "1":{
10        "IP":"<host>:<port>",
11        "benchmarks":{
12            "ram":16,
13            "bandwidth":10
14        }
15    },
16    "2":{
17        "IP":"<host>:<port>",
18        "benchmarks":{
19            "ram":8,
20            "bandwidth":10
21        }
22    }
23    // Add as many nodes as you need here
24}

The ram values for each node, which can signify either CPU RAM or GPU VRAM, are mentioned in GBs while the bandwidth is in Mbps.

Note

In future releases, we intend to implement mechanisms to dynamically create and update this JSON file as and when new nodes join the training session. For now, we can work by manually defining the provider nodes in the above format.

Model Fragmentation and Cluster Formation

The next step is to first form clusters of provider nodes followed by fragmentation of the main PyTorch model into sub-models and assigning them to individual provider nodes. Ravnest handles model fragmentation and orchestrates the cluster formation simultaneously, ensuring an optimal distribution of model parameters and computational load across the available provider nodes.

To achieve this, Ravnest needs to have a good estimation of how much maximum memory usage the model will require. This information is crucial for ensuring optimal cluster formation. Therefore, we pass a dummy input along with the main PyTorch model into Ravnest’s clusterize() method.

1import torch
2from ravnest import clusterize, set_seed
3
4set_seed(42)
5
6model = DL_Model()    # The main PyTorch model which was previously defined/loaded.
7example_args = torch.rand((64,3,28,28))    # Sample input that the main PyTorch model expects.
8
9clusterize(model=model, example_args=(example_args,))

For reproducibility, we encourage you to use set_seed() method. Running the above code spawns a few subfolders housing some metadata inside the node_data folder. If you explore the metadata, you will be able to spot the resultant sub-models.

Inferring Provider Roles

The cluster assigned to each individual provider node will be visible in the logs of clusterize() method. For instance:

 1Node(0, Cluster(1))
 2self.IP(0.0.0.0:8080)
 3Ring IDs({0: 'L__self___conv2d_1.weight'})
 4Address2Param({'0.0.0.0:8081': 'L__self___conv2d_1.weight'})
 5
 6
 7Node(1, Cluster(0))
 8self.IP(0.0.0.0:8081)
 9Ring IDs({0: 'L__self___conv2d_1.weight'})
10Address2Param({'0.0.0.0:8080': 'L__self___conv2d_1.weight'})
11
12
13Node(2, Cluster(0))
14self.IP(0.0.0.0:8082)
15Ring IDs({1: 'L__self___dense_1.weight'})
16Address2Param({'0.0.0.0:8083': 'L__self___dense_1.weight'})
17
18
19Node(3, Cluster(1))
20self.IP(0.0.0.0:8083)
21Ring IDs({1: 'L__self___dense_1.weight'})
22Address2Param({'0.0.0.0:8082': 'L__self___dense_1.weight'})
23
24
25Node(4, Cluster(1))
26self.IP(0.0.0.0:8084)
27Ring IDs({2: 'L__self___bn_3.weight'})
28Address2Param({'0.0.0.0:8085': 'L__self___bn_3.weight'})
29
30
31Node(5, Cluster(0))
32self.IP(0.0.0.0:8085)
33Ring IDs({2: 'L__self___bn_3.weight'})
34Address2Param({'0.0.0.0:8084': 'L__self___bn_3.weight'})

From the above log, by looking at the order of node assignment for each cluster, the following can be inferred:

1Cluster 0 : Node(1) -> Node(2) -> Node(5)
2Cluster 1 : Node(0) -> Node(3) -> Node(4)

This makes it easy to identify the roles of each Provider node:

1Node(0) -> Root
2Node(1) -> Root
3Node(2) -> Stem
4Node(3) -> Stem
5Node(4) -> Leaf
6Node(5) -> Leaf

Preparing the Provider Script

Now that the main model has been divided into sub-models and provider nodes have been organized into clusters, we can prepare the unified code that each Provider needs to execute according to their position within their designated cluster. The responsibilities and characteristics of the different roles that Providers can take up within a cluster have been covered in detail here.

Here’s the template for creating a unified Provider script:

 1import torch
 2from torch.utils.data import DataLoader
 3from ravnest import Node, Trainer, set_seed
 4
 5set_seed(42)
 6
 7def preprocess_dataset():
 8    """
 9    Method to pre-process the dataset.
10    Here you must prepare PyTorch DataLoader Objects for Training and Validation.
11    It is recommended to use the torch.Generator() object with manual_seed() set in the DataLoader to ensure data is loaded in the correct order across providers the cluster.
12    """
13    ...
14    return train_loader, val_loader
15
16def loss_fn(predictions, targets):
17    """
18    Method that defines how loss criterion needs to be evaluated.
19    Argument targets is the next batch from the DataLoader passed as `labels` in your Node() instance.
20    Extract the appropriate target labels from it here and set device, adjust dtype etc accordingly.
21    If your update frequency > 1, you may want to divide your final loss by the update frequency value.
22    """
23    return loss_value
24
25if __name__ == '__main__':
26
27    train_loader, val_loader = preprocess_dataset()
28
29    node = Node(name='node_<id>',   # Strictly in format: 'node_0', 'node_3', 'node_10' etc.
30                criterion=loss_fn,  # Only the name of your defined loss_fn method without calling.
31                 ...)   # Pass other parameters to define your Node, including optimizer, labels, test_labels etc.
32
33    trainer = Trainer(node=node, ...)     # Pass appropriate parameters like epochs, train_loader, val_loader etc. This can also be a Custom Trainer class instance that extends Ravnest's Trainer.
34
35    trainer.train()     # Commences Training
36    trainer.evaluate()  # To check accuracy of model post-training.

Note

Please ensure that the correct name is passed to the Node() instance as a string in the format : 'node_0', 'node_7'. Ravnest automatically determines the Provider’s role based on this name parameter, so accuracy is essential. Also, make sure you pass only the name of your loss_fn as criterion to the Node() instance without calling it.

In decentralized training, it is crucial that the data order is synchronized across all nodes to maintain the integrity of the training process. Since the training loss is ultimately evaluated at the Leaf node (another type of node present at the end of the cluster), the data instances processed by the Root Provider must match those processed by the Leaf Provider.

For the training to be accurate, the order of data instances in the DataLoader used by the Root Provider must be identical to the order in the DataLoader used by the Leaf Provider. This synchronization ensures that each data instance is paired with the correct true label during training, which is essential for the model to learn correctly. To ensure this, we utilize Ravnest’s set_seed() method and pass the same seed value across all Provider scripts in a cluster.

Incase you intend to employ data shuffling inside the DataLoader, we strongly encourage you to additionally define a torch.Generator() object and pass it on to your DataLoader instance. Doing so helps maintain the order of the data instances when shuffle=True.

In the template provided above, ensure to include the labels and test_labels parameters (as DataLoader instances) when initializing Node(), enabling accurate evaluation of training and validation losses with the correct labels.

Refer to the API Documentation section for details on the other parameters that are required by Node and Trainer.