Features
GPU Support
Ravnest significantly benefits from GPU acceleration, enhancing the performance and efficiency of model training across distributed environments. The platform seamlessly integrates GPU support for Provider nodes, leveraging their parallel processing capabilities to handle complex computations faster than traditional CPU-based machines.
If a Provider has an NVIDIA GPU on their machine, they can enable GPU-acceleration by setting the device parameter of Node() object to torch.device('cuda').
Custom Trainers
By subclassing the existing Trainer class from Ravnest, you can incorporate specialized logic for handling model training, validation, and metrics specific to your project’s needs. This approach not only streamlines the process of integrating new models but also provides flexibility in adapting to diverse training scenarios and evolving requirements.
1import ravnest
2
3class Custom_Trainer(ravnest.Trainer):
4 def __init__(self, node=None, train_loader=None, epochs=1):
5 super().__init__(node=node, train_loader=train_loader, epochs=epochs)
6
7 # Overwrite the train() method as per your requirements:
8 def train(self):
9 self.prelim_checks() # Mandatory function call at start of train() method
10 '''
11 Training Loop goes here.
12 Use self.node.forward_compute() to perform forward pass.
13 Use self.node.wait_for_backwards() at end of every epoch to uphold order of respective backward passes.
14 '''
15 ...
As an example, here’s a custom Trainer class for pre-training BERT LLM that expects multiple tensors as inputs:
1import ravnest
2
3class BERT_Trainer(ravnest.Trainer):
4 def __init__(self, node=None, train_loader=None, epochs=1):
5 super().__init__(node=node, train_loader=train_loader, epochs=epochs)
6
7 def train(self):
8 self.prelim_checks() # Mandatory function call
9 for epoch in range(self.epochs):
10 for batch in self.train_loader:
11 self.node.forward_compute(tensors=batch['input_ids'],
12 l_token_type_ids_=batch['token_type_ids'],
13 l_attention_mask_=batch['attention_mask'])
14
15 self.node.wait_for_backwards() # To be called at end of every epoch
16
17 print('BERT Training Done!')
Data Compression
Ravnest uses advanced data compression techniques to optimize distributed training. These techniques significantly decrease data transmission between provider nodes, reducing network overhead and improving training efficiency. By compressing model parameters and gradients, Ravnest enables faster communication and synchronization among nodes, leading to shorter training times. To activate this capability, providers can set compression=True in the Node() object.
This feature is particularly advantageous in bandwidth-constrained environments, maximizing resource utilization. Moreover, Ravnest’s compression mechanism maintains the integrity and accuracy of the training process, safeguarding model performance.