API Documentation
ravnest.node
- class ravnest.node.Node(name=None, model=None, optimizer=None, optimizer_params={}, lr_scheduler=None, lr_scheduler_params={}, lr_step_on_epoch_change=True, criterion=None, update_frequency=1, reduce_factor=None, labels=None, test_labels=None, device=torch.device, loss_filename='losses.txt', compression=False, average_optim=False, **kwargs)[source]
Responsible for managing the computational and communication aspects of a distributed machine learning model, including model initialization, parameter synchronization, forward and backward passes, loss computation, and communication between different nodes in the system.
- Parameters:
name (str) – The name of the node. Strictly in the format: ‘node_0’, ‘node_17’ etc.
model (torch.nn.Module) – The PyTorch model associated with the node.
optimizer (torch.optim.Optimizer) – The optimizer used for training the model.
optimizer_params (dict) – Parameters for the optimizer.
lr_scheduler (torch.optim.lr_scheduler) – The learning rate scheduler.
lr_scheduler_params (dict) – Parameters for the learning rate scheduler.
lr_step_on_epoch_change (bool) – Whether to step the learning rate scheduler on epoch change.
criterion (callable) – The loss function.
update_frequency (int) – Frequency of model parameter updates.
reduce_factor (int) – Frequency at which all-reduce will be triggered i.e. trigger all-reduce every time these many updates are done.
labels (torch.utils.data.DataLoader) – Dataloader containing labels.
test_labels (torch.utils.data.DataLoader) – Test labels for validation.
device (torch.device) – The device on which the model will be run (CPU or GPU).
loss_filename (str) – The filename to save loss values.
compression (bool) – Whether to use compression.
kwargs (dict) – Additional arguments.
- check_load_forward_buffer()[source]
Check and process the load forward buffer for incoming data.
Continuously monitors the load forward buffer and processes incoming data for forward pass computations.
- forward_compute(tensors=None, **kwargs)[source]
Initiate a forward computation request.
Adds the forward computation request to the load forward buffer, ensuring synchronization and handling of computational resources.
- Parameters:
tensors (torch.Tensor, optional) – Input tensors for the forward computation, defaults to None
kwargs (dict, optional) – Additional keyword arguments for the computation, defaults to {}
- init_server(load_forward_buffer=None, load_backward_buffer=None, reduce_ring_buffers=None, gather_ring_buffers=None, latest_weights_buffer=None, forward_lock=None, backward_lock=None, reduce_lock=None, gather_lock=None, latest_weights_lock=None, reduce_iteration=None, gather_iteration=None)[source]
Initialize the gRPC server for handling communication with other nodes.
- Parameters:
load_forward_buffer (multiprocessing.Manager.list, optional) – Shared buffer for incoming forward pass data, defaults to None
load_backward_buffer (multiprocessing.Manager.list, optional) – Shared buffer for incoming backward pass data, defaults to None
reduce_ring_buffers (multiprocessing.Manager.dict, optional) – Shared dictionary for reduce operation buffers, defaults to None
gather_ring_buffers (multiprocessing.Manager.dict, optional) – Shared dictionary for gather operation buffers, defaults to None
forward_lock (multiprocessing.Lock, optional) – Lock for synchronizing access to forward buffers, defaults to None
backward_lock (multiprocessing.Lock, optional) – Lock for synchronizing access to backward buffers, defaults to None
reduce_lock (multiprocessing.Lock, optional) – Lock for synchronizing reduce operations, defaults to None
gather_lock (multiprocessing.Lock, optional) – Lock for synchronizing gather operations, defaults to None
reduce_iteration (multiprocessing.Manager.dict, optional) – Shared dictionary for reduce iteration counts, defaults to None
gather_iteration (multiprocessing.Manager.dict, optional) – Shared dictionary for gather iteration counts, defaults to None
- no_grad_forward_compute(tensors=None, output_type=None)[source]
Perform a forward pass without computing gradients.
Executes a forward pass without gradient computation and sends the output to the designated target host and port.
- Parameters:
tensors (torch.Tensor, optional) – Input tensors for the forward pass, defaults to None
output_type (str, optional) – Type of output computation (e.g., validation accuracy), defaults to None
- reset()[source]
Reset the node’s auxiliary and stateful data.
Cleans up temporary directories and files associated with the node, preparing it for a fresh start.
- start()[source]
Start the gRPC server and buffer checking threads.
Spawns a process for serving gRPC requests and starts a thread for checking and processing incoming data buffers.
- start_grpc_server()[source]
Start the gRPC server asynchronously.
Uses asyncio to start the gRPC server in an asynchronous manner.
ravnest.trainer
- class ravnest.trainer.Trainer(node=None, lr_scheduler=None, lr_scheduler_params={}, train_loader=None, val_loader=None, val_freq=1, save=False, epochs=1, batch_size=64, step_size=1, inputs_dtype=None)[source]
A Trainer class for training machine learning models with support for custom learning rate schedulers, training and validation data loaders, and configurable training parameters. This class can be extended to create custom trainers as per your training requirements.
- Parameters:
node (object) – A
ravnest.nodeobject that contains the optimizer for training.lr_scheduler (object, optional) – Learning rate scheduler for the optimizer, defaults to None.
lr_scheduler_params (dict, optional) – Parameters for the learning rate scheduler, defaults to None.
train_loader (DataLoader) – DataLoader for the training dataset.
val_loader (DataLoader, optional) – DataLoader for the validation dataset, defaults to None.
val_freq (int, optional) – Frequency of validation checks during training, defaults to 1.
save (bool, optional) – Whether to save the submodel after training, defaults to False.
epochs (int, optional) – Number of epochs to train the model, defaults to 1.
batch_size (int, optional) – Size of the training batches, defaults to 64.
step_size (int, optional) – Step size for gradient updates, defaults to 1.
inputs_dtype (torch.dtype, optional) – Data type for the input tensors, defaults to None.
n_forwards (int) – Counter for the number of forward passes, initialized to 0.
- evaluate()[source]
Evaluate the trained model on the validation data.
Performs inference on validation data and computes validation accuracy using the trained model.
ravnest.utils
- ravnest.utils.model_fusion(cluster_id=0)[source]
Fuses state dictionaries from TorchScript submodels (.pt files) hosted on all Providers belonging to a cluster. The combined state dictionary is then saved at ‘trained/trained_state_dict.pt’. This final state_dict can then be loaded into the main model using
model.load_state_dict()for obtaining the final trained main model.Make sure to set the
saveparameter ofTrainer()instance toTruefor saving submodels post-training. Only then this method to work.- Parameters:
cluster_id (int, optional) – ID of the cluster whose submodels need to be combined into the main model, defaults to 0
ravnest.operations.utils
- ravnest.operations.utils.clusterize(model=None, example_args=(), example_kwargs={}, pass_data=False)[source]
Takes the complete deep learning model and forms clusters from a pool of compute nodes defined in
node_data/node_configs.jsonfile. Automates the whole process of address sharing across nodes, reduction ring formation and seamlessly stores the results as node metadata json files for each node innode_data/nodes/folder. These metadata files are later used byravnest.node.Nodeclass to load all relevant attributes pertaining to a node.- Parameters:
model – The complete Pytorch Model that needs to be split, defaults to None
example_args – A sample torch tensor that the model expects as input during forward pass, defaults to ()
example_kwargs – Any extra sample inputs that the model expects passed as a dictionary, defaults to {}
pass_data – If set to true, this performs a full forward pass with
example_argtensor to calculate the size of full model. If set to false, it will still calculate full model size using simpler mathematical techniques. Note that disabling this may not work for all models. Defaults to False
- Raises:
ValueError – If the sum of the node RAMs in a cluster does not exceed full model’s size.