API Documentation

ravnest.node

class ravnest.node.Node(name=None, model=None, optimizer=None, optimizer_params={}, lr_scheduler=None, lr_scheduler_params={}, lr_step_on_epoch_change=True, criterion=None, update_frequency=1, reduce_factor=None, labels=None, test_labels=None, device=torch.device, loss_filename='losses.txt', compression=False, average_optim=False, **kwargs)[source]

Responsible for managing the computational and communication aspects of a distributed machine learning model, including model initialization, parameter synchronization, forward and backward passes, loss computation, and communication between different nodes in the system.

Parameters:
  • name (str) – The name of the node. Strictly in the format: ‘node_0’, ‘node_17’ etc.

  • model (torch.nn.Module) – The PyTorch model associated with the node.

  • optimizer (torch.optim.Optimizer) – The optimizer used for training the model.

  • optimizer_params (dict) – Parameters for the optimizer.

  • lr_scheduler (torch.optim.lr_scheduler) – The learning rate scheduler.

  • lr_scheduler_params (dict) – Parameters for the learning rate scheduler.

  • lr_step_on_epoch_change (bool) – Whether to step the learning rate scheduler on epoch change.

  • criterion (callable) – The loss function.

  • update_frequency (int) – Frequency of model parameter updates.

  • reduce_factor (int) – Frequency at which all-reduce will be triggered i.e. trigger all-reduce every time these many updates are done.

  • labels (torch.utils.data.DataLoader) – Dataloader containing labels.

  • test_labels (torch.utils.data.DataLoader) – Test labels for validation.

  • device (torch.device) – The device on which the model will be run (CPU or GPU).

  • loss_filename (str) – The filename to save loss values.

  • compression (bool) – Whether to use compression.

  • kwargs (dict) – Additional arguments.

check_load_forward_buffer()[source]

Check and process the load forward buffer for incoming data.

Continuously monitors the load forward buffer and processes incoming data for forward pass computations.

forward_compute(tensors=None, **kwargs)[source]

Initiate a forward computation request.

Adds the forward computation request to the load forward buffer, ensuring synchronization and handling of computational resources.

Parameters:
  • tensors (torch.Tensor, optional) – Input tensors for the forward computation, defaults to None

  • kwargs (dict, optional) – Additional keyword arguments for the computation, defaults to {}

grpc_server_serve()[source]

Starts the gRPC server and listens for incoming connections.

init_server(load_forward_buffer=None, load_backward_buffer=None, reduce_ring_buffers=None, gather_ring_buffers=None, latest_weights_buffer=None, forward_lock=None, backward_lock=None, reduce_lock=None, gather_lock=None, latest_weights_lock=None, reduce_iteration=None, gather_iteration=None)[source]

Initialize the gRPC server for handling communication with other nodes.

Parameters:
  • load_forward_buffer (multiprocessing.Manager.list, optional) – Shared buffer for incoming forward pass data, defaults to None

  • load_backward_buffer (multiprocessing.Manager.list, optional) – Shared buffer for incoming backward pass data, defaults to None

  • reduce_ring_buffers (multiprocessing.Manager.dict, optional) – Shared dictionary for reduce operation buffers, defaults to None

  • gather_ring_buffers (multiprocessing.Manager.dict, optional) – Shared dictionary for gather operation buffers, defaults to None

  • forward_lock (multiprocessing.Lock, optional) – Lock for synchronizing access to forward buffers, defaults to None

  • backward_lock (multiprocessing.Lock, optional) – Lock for synchronizing access to backward buffers, defaults to None

  • reduce_lock (multiprocessing.Lock, optional) – Lock for synchronizing reduce operations, defaults to None

  • gather_lock (multiprocessing.Lock, optional) – Lock for synchronizing gather operations, defaults to None

  • reduce_iteration (multiprocessing.Manager.dict, optional) – Shared dictionary for reduce iteration counts, defaults to None

  • gather_iteration (multiprocessing.Manager.dict, optional) – Shared dictionary for gather iteration counts, defaults to None

no_grad_forward_compute(tensors=None, output_type=None)[source]

Perform a forward pass without computing gradients.

Executes a forward pass without gradient computation and sends the output to the designated target host and port.

Parameters:
  • tensors (torch.Tensor, optional) – Input tensors for the forward pass, defaults to None

  • output_type (str, optional) – Type of output computation (e.g., validation accuracy), defaults to None

reset()[source]

Reset the node’s auxiliary and stateful data.

Cleans up temporary directories and files associated with the node, preparing it for a fresh start.

start()[source]

Start the gRPC server and buffer checking threads.

Spawns a process for serving gRPC requests and starts a thread for checking and processing incoming data buffers.

start_grpc_server()[source]

Start the gRPC server asynchronously.

Uses asyncio to start the gRPC server in an asynchronous manner.

trigger_save_submodel()[source]

Trigger saving of the current submodel state.

Saves the current state of the model to disk and optionally sends the updated model state to the designated target host and port.

wait_for_backwards()[source]

Wait until all backward passes are completed.

Checks and waits until all initiated backward computations are finished before proceeding with further operations.

ravnest.trainer

class ravnest.trainer.Trainer(node=None, lr_scheduler=None, lr_scheduler_params={}, train_loader=None, val_loader=None, val_freq=1, save=False, epochs=1, batch_size=64, step_size=1, inputs_dtype=None)[source]

A Trainer class for training machine learning models with support for custom learning rate schedulers, training and validation data loaders, and configurable training parameters. This class can be extended to create custom trainers as per your training requirements.

Parameters:
  • node (object) – A ravnest.node object that contains the optimizer for training.

  • lr_scheduler (object, optional) – Learning rate scheduler for the optimizer, defaults to None.

  • lr_scheduler_params (dict, optional) – Parameters for the learning rate scheduler, defaults to None.

  • train_loader (DataLoader) – DataLoader for the training dataset.

  • val_loader (DataLoader, optional) – DataLoader for the validation dataset, defaults to None.

  • val_freq (int, optional) – Frequency of validation checks during training, defaults to 1.

  • save (bool, optional) – Whether to save the submodel after training, defaults to False.

  • epochs (int, optional) – Number of epochs to train the model, defaults to 1.

  • batch_size (int, optional) – Size of the training batches, defaults to 64.

  • step_size (int, optional) – Step size for gradient updates, defaults to 1.

  • inputs_dtype (torch.dtype, optional) – Data type for the input tensors, defaults to None.

  • n_forwards (int) – Counter for the number of forward passes, initialized to 0.

evaluate()[source]

Evaluate the trained model on the validation data.

Performs inference on validation data and computes validation accuracy using the trained model.

pred(data)[source]

Perform prediction on sample test data using the trained model.

Parameters:

data (np.ndarray or torch.Tensor) – Sample data for prediction, can be numpy array or torch tensor

Returns:

Prediction result

Return type:

torch.Tensor

train()[source]

Train the model using the specified training and validation data loaders.

Iterates over the training data for the specified number of epochs, performing forward computations, updating parameters, and optionally evaluating on validation data.

ravnest.utils

ravnest.utils.model_fusion(cluster_id=0)[source]

Fuses state dictionaries from TorchScript submodels (.pt files) hosted on all Providers belonging to a cluster. The combined state dictionary is then saved at ‘trained/trained_state_dict.pt’. This final state_dict can then be loaded into the main model using model.load_state_dict() for obtaining the final trained main model.

Make sure to set the save parameter of Trainer() instance to True for saving submodels post-training. Only then this method to work.

Parameters:

cluster_id (int, optional) – ID of the cluster whose submodels need to be combined into the main model, defaults to 0

ravnest.utils.set_seed(seed=42)[source]

Set the seed for random number generators across torch, numpy and random modules. Handles seed for torch.cuda based on GPU availability.

Parameters:

seed (int, optional) – seed number, defaults to 42

ravnest.operations.utils

ravnest.operations.utils.clusterize(model=None, example_args=(), example_kwargs={}, pass_data=False)[source]

Takes the complete deep learning model and forms clusters from a pool of compute nodes defined in node_data/node_configs.json file. Automates the whole process of address sharing across nodes, reduction ring formation and seamlessly stores the results as node metadata json files for each node in node_data/nodes/ folder. These metadata files are later used by ravnest.node.Node class to load all relevant attributes pertaining to a node.

Parameters:
  • model – The complete Pytorch Model that needs to be split, defaults to None

  • example_args – A sample torch tensor that the model expects as input during forward pass, defaults to ()

  • example_kwargs – Any extra sample inputs that the model expects passed as a dictionary, defaults to {}

  • pass_data – If set to true, this performs a full forward pass with example_arg tensor to calculate the size of full model. If set to false, it will still calculate full model size using simpler mathematical techniques. Note that disabling this may not work for all models. Defaults to False

Raises:

ValueError – If the sum of the node RAMs in a cluster does not exceed full model’s size.