weight_optimizer – Selection of weight optimizers
=================================================

Description
+++++++++++
A weight optimizer is an algorithm that adjusts the synaptic weights in a
network during training to minimize the loss function and thus improve the
network's performance on a given task.

This method is an essential part of plasticity rules like e-prop plasticity.

Currently two weight optimizers are implemented: gradient descent and the Adam optimizer.

In gradient descent [1]_ the weights are optimized via:

.. math::
  W_t = W_{t-1} - \eta \, g_t \,,

whereby :math:`\eta` denotes the learning rate and :math:`g_t` the gradient of the current
time step :math:`t`.

In the Adam scheme [2]_ the weights are optimized via:

.. math::
  m_0 &= 0, v_0 = 0, t = 1 \,, \\
  m_t &= \beta_1 \, m_{t-1} + \left(1-\beta_1\right) \, g_t \,, \\
  v_t &= \beta_2 \, v_{t-1} + \left(1-\beta_2\right) \, g_t^2 \,, \\
  \hat{m}_t &= \frac{m_t}{1-\beta_1^t} \,, \\
  \hat{v}_t &= \frac{v_t}{1-\beta_2^t} \,, \\
  W_t &= W_{t-1} - \eta\frac{\hat{m_t}}{\sqrt{\hat{v}_t} + \epsilon} \,.

Parameters
++++++++++

The following parameters can be set in the status dictionary.

========== ==== ========================= ======= =================================
**Common optimizer parameters**
-----------------------------------------------------------------------------------
Parameter  Unit  Math equivalent          Default Description
========== ==== ========================= ======= =================================
batch_size                                      1 Size of batch
eta             :math:`\eta`                 1e-4 Learning rate
Wmax         pA :math:`W_{ji}^\text{max}`   100.0 Maximal value for synaptic weight
Wmin         pA :math:`W_{ji}^\text{min}`  -100.0 Minimal value for synaptic weight
========== ==== ========================= ======= =================================

========= ==== =============== ================ ==============
**Gradient descent parameters (default optimizer)**
--------------------------------------------------------------
Parameter Unit Math equivalent Default          Description
========= ==== =============== ================ ==============
type                           gradient_descent Optimizer type
========= ==== =============== ================ ==============

========= ==== ================ ======= =================================================
**Adam optimizer parameters**
-----------------------------------------------------------------------------------------
Parameter Unit Math equivalent  Default Description
========= ==== ================ ======= =================================================
type                               adam Optimizer type
beta_1         :math:`\beta_1`      0.9 Exponential decay rate for first moment estimate
beta_2         :math:`\beta_2`    0.999 Exponential decay rate for second moment estimate
epsilon        :math:`\epsilon`    1e-8 Small constant for numerical stability
========= ==== ================ ======= =================================================

The following state variables evolve during simulation.

============== ==== =============== ============= ==========================
**Adam optimizer state variables for individual synapses**
----------------------------------------------------------------------------
State variable Unit Math equivalent Initial value Description
============== ==== =============== ============= ==========================
m                   :math:`m`                 0.0 First moment estimate
v                   :math:`v`                 0.0 Second moment raw estimate
============== ==== =============== ============= ==========================


References
++++++++++
.. [1] Huh, D. & Sejnowski, T. J. Gradient descent for spiking neural networks. 32nd
       Conference on Neural Information Processing Systems (2018).
.. [2] Kingma DP, Ba JL (2015). Adam: A method for stochastic optimization.
       Proceedings of International Conference on Learning Representations (ICLR).
       https://doi.org/10.48550/arXiv.1412.6980

See also
++++++++
:doc:`E-Prop Plasticity <index_e-prop plasticity>`

Examples using this model
++++++++++++++++++++++++++

.. listexamples:: eprop_synapse_bsshslm_2020