Optimizer#

from streamer.optimizers.streamer_optimizer import StreamerOptimizerArguments, StreamerOptimizer

optim_args = StreamerOptimizerArguments(world_size=8,
                                        alpha=3,
                                        max_layers=3,
                                        optimize_every=100,
                                        average_every=100,
                                        hgn_timescale=True,
                                        hgn_reach=True,
                                        )
optimizer = StreamerOptimizer(optim_args)
class streamer.optimizers.streamer_optimizer.StreamerOptimizerArguments(world_size: int = 1, alpha: int = 3, max_layers: int = 3, optimize_every: int = 100, average_every: int = 1000, hgn_timescale: bool = True, hgn_reach: bool = True)[source]#
world_size: int = 1#

Number of gpus to distribute the dataset

alpha: int = 3#

The reach parameter for Hierarchical Gradient Normalization

max_layers: int = 3#

The maximum number of layers to stack

optimize_every: int = 100#

Take a gradient step every this value

average_every: int = 1000#

Average models across gpus every this value

hgn_timescale: bool = True#

Allow timescale parameter in Hierarchical Gradient Normalization

hgn_reach: bool = True#

Allow reach parameter in Hierarchical Gradient Normalization

class streamer.optimizers.streamer_optimizer.StreamerOptimizer(model, args: StreamerOptimizerArguments)[source]#

The optimizer used with streamer. This class takes care of optimization, Gradient normalization and averaging across gpus.

Parameters:

args (StreamerOptimizerArguments) – The parameters used for the Streamer optimizer

get_param_groups(layer_num)[source]#

Calculates the parameter groups and their weights. For example if the layer_num is 1 and we have 4 layers, then parameter groups will be [[1],[0,2],[3]] and their weights will depend on alpha but typically more on the early groups (e.g., [0.8, 0.15, 0.05])

Parameters:

layer_num (int) – the index of the layer

Returns:

  • (List(List(int))): List of Lists dividing the layers into groups to assign different gradient multipliers to them

  • (List(float)): The weights assigned to the parameter groups

get_gradients()[source]#

accumulates gradient on all layers’ parameters from all losses in the model

reset()[source]#

Resets the counters of the optimizer

scale_gradients()[source]#

Scales the gradients of all modules by the counters to normalize the gradients

average_models()[source]#

Average the model parameters across all gpus every average_every()

step()[source]#

Call the optimizer, which calulates the gradients and accumulates it on the parameters. This function does not actually do gradient stepping. It has a counter that does it every optimize_every()