Optimizer#
from streamer.optimizers.streamer_optimizer import StreamerOptimizerArguments, StreamerOptimizer
optim_args = StreamerOptimizerArguments(world_size=8,
alpha=3,
max_layers=3,
optimize_every=100,
average_every=100,
hgn_timescale=True,
hgn_reach=True,
)
optimizer = StreamerOptimizer(optim_args)
- class streamer.optimizers.streamer_optimizer.StreamerOptimizerArguments(world_size: int = 1, alpha: int = 3, max_layers: int = 3, optimize_every: int = 100, average_every: int = 1000, hgn_timescale: bool = True, hgn_reach: bool = True)[source]#
- world_size: int = 1#
Number of gpus to distribute the dataset
- alpha: int = 3#
The reach parameter for Hierarchical Gradient Normalization
- max_layers: int = 3#
The maximum number of layers to stack
- optimize_every: int = 100#
Take a gradient step every this value
- average_every: int = 1000#
Average models across gpus every this value
- hgn_timescale: bool = True#
Allow timescale parameter in Hierarchical Gradient Normalization
- hgn_reach: bool = True#
Allow reach parameter in Hierarchical Gradient Normalization
- class streamer.optimizers.streamer_optimizer.StreamerOptimizer(model, args: StreamerOptimizerArguments)[source]#
The optimizer used with streamer. This class takes care of optimization, Gradient normalization and averaging across gpus.
- Parameters:
args (StreamerOptimizerArguments) – The parameters used for the Streamer optimizer
- get_param_groups(layer_num)[source]#
Calculates the parameter groups and their weights. For example if the layer_num is 1 and we have 4 layers, then parameter groups will be [[1],[0,2],[3]] and their weights will depend on alpha but typically more on the early groups (e.g., [0.8, 0.15, 0.05])
- Parameters:
layer_num (int) – the index of the layer
- Returns:
(List(List(int))): List of Lists dividing the layers into groups to assign different gradient multipliers to them
(List(float)): The weights assigned to the parameter groups
- get_gradients()[source]#
accumulates gradient on all layers’ parameters from all losses in the model
- scale_gradients()[source]#
Scales the gradients of all modules by the counters to normalize the gradients
- average_models()[source]#
Average the model parameters across all gpus every
average_every()
- step()[source]#
Call the optimizer, which calulates the gradients and accumulates it on the parameters. This function does not actually do gradient stepping. It has a counter that does it every
optimize_every()