Optimizer#

from streamer.optimizers.streamer_optimizer import StreamerOptimizerArguments, StreamerOptimizer

optim_args = StreamerOptimizerArguments(world_size=8,
                                        alpha=3,
                                        max_layers=3,
                                        optimize_every=100,
                                        average_every=100,
                                        hgn_timescale=True,
                                        hgn_reach=True,
                                        )
optimizer = StreamerOptimizer(optim_args)

class streamer.optimizers.streamer_optimizer.StreamerOptimizerArguments(world_size: int = 1, alpha: int = 3, max_layers: int = 3, optimize_every: int = 100, average_every: int = 1000, hgn_timescale: bool = True, hgn_reach: bool = True)[source]#

world_size: int = 1#: Number of gpus to distribute the dataset

alpha: int = 3#: The reach parameter for Hierarchical Gradient Normalization

max_layers: int = 3#: The maximum number of layers to stack

optimize_every: int = 100#: Take a gradient step every this value

average_every: int = 1000#: Average models across gpus every this value

hgn_timescale: bool = True#: Allow timescale parameter in Hierarchical Gradient Normalization

hgn_reach: bool = True#: Allow reach parameter in Hierarchical Gradient Normalization

class streamer.optimizers.streamer_optimizer.StreamerOptimizer(model, args: StreamerOptimizerArguments)[source]#

The optimizer used with streamer. This class takes care of optimization, Gradient normalization and averaging across gpus.

Parameters:: args (StreamerOptimizerArguments) – The parameters used for the Streamer optimizer

get_param_groups(layer_num)[source]#

Calculates the parameter groups and their weights. For example if the layer_num is 1 and we have 4 layers, then parameter groups will be [[1],[0,2],[3]] and their weights will depend on alpha but typically more on the early groups (e.g., [0.8, 0.15, 0.05])

Parameters:

layer_num (int) – the index of the layer

Returns:

(List(List(int))): List of Lists dividing the layers into groups to assign different gradient multipliers to them
(List(float)): The weights assigned to the parameter groups

get_gradients()[source]#: accumulates gradient on all layers’ parameters from all losses in the model

reset()[source]#: Resets the counters of the optimizer

scale_gradients()[source]#: Scales the gradients of all modules by the counters to normalize the gradients

average_models()[source]#: Average the model parameters across all gpus every average_every()

step()[source]#: Call the optimizer, which calulates the gradients and accumulates it on the parameters. This function does not actually do gradient stepping. It has a counter that does it every optimize_every()