Skip to content

A collection of optimizers, some arcane others well known, for Flax.

License

Notifications You must be signed in to change notification settings

nestordemeure/flaxOptimizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flax Optimizers

A collection of optimizers for Flax. The repository is open to pull requests.

Installation

You can install this librarie with:

pip install git+https://github.com/nestordemeure/flaxOptimizers.git

Optimizers

Classical optimizers, inherited from the official Flax implementation:

  • Adafactor A memory efficient optimizer, has been used for large-scale training of attention-based models.
  • Adagrad Introduces a denominator to SGD so that each parameter has its own learning rate.
  • Adam The most common stochastic optimizer nowadays.
  • LAMB Improvement on LARS to makes it efficient across task types.
  • LARS An optimizer designed for large batch.
  • Momentum SGD with momentum, optionally Nesterov momentum.
  • RMSProp Developped to solve Adagrad's diminushing learning rate problem.
  • SGD The simplest stochastic gradient descent optimizer possible.

More arcane first-order optimizers:

  • AdamHD Uses hypergradient descent to tune its own learning rate. Good at the begining of the training but tends to underperform at the end.
  • AdamP Corrects premature step-size decay for scale-invariant weights. Useful when a model uses some form of Batch normalization.
  • LapProp Applies exponential smoothing to update rather than gradient.
  • MADGRAD Modernisation of the Adagrad family of optimizers, very competitive with Adam.
  • RAdam Uses a rectified variance estimation to compute the learning rate. Makes training smoother, especially in the first iterations.
  • RAdamSimplified Warmup strategy proposed to reproduce RAdam's result with a much decreased code complexity.
  • Ranger Combines look-ahead, RAdam and gradient centralization to try and maximize performances. Designed with picture classification problems in mind.
  • Ranger21 An upgrade of Ranger that combines adaptive gradient clipping, gradient centralization, positive-negative momentum, norm loss, stable weight-decay, linear learning rate warm up, explore exploite scheduling, lookahead and Adam. It has been designed with transformers in mind.
  • Sadam Introduces an alternative to the epsilon parameter.

Optimizer wrappers:

  • WeightNorm Alternative to BatchNormalization, does the weight normalization inside the optimizer which makes it compatible with more models and faster (official Flax implementation)

Other references

  • AdahessianJax contains my implementation of the Adahessian second order optimizer in Flax.
  • Flax.optim contains a number of optimizer that currently do not appear in the official documentation. They are all included accesible from this librarie.

About

A collection of optimizers, some arcane others well known, for Flax.

Topics

Resources

License

Stars

Watchers

Forks

Languages