The many shades of Gradient Descent (GD).


Definition and Syntax

bool gd(arma::vec& init_out_vals, std::function<double (const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)> opt_objfn, void* opt_data);
bool gd(arma::vec& init_out_vals, std::function<double (const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)> opt_objfn, void* opt_data, algo_settings_t& settings);

Function arguments:

  • init_out_vals a column vector of initial values; will be replaced by the solution values.
  • opt_objfn the function to be minimized, taking three arguments:
    • vals_inp a vector of inputs;
    • grad_out a vector to store the gradient; and
    • opt_data additional parameters passed to the function.
  • opt_data additional parameters passed to the function.
  • settings parameters controlling the optimization routine; see below.

Optimization control parameters:

  • double err_tol the tolerance value controlling how small $\| \nabla f \|$ should be before 'convergence' is declared.
  • int iter_max the maximum number of iterations/updates before the algorithm exits.
  • bool vals_bound whether the search space is bounded. If true, then
    • arma::vec lower_bounds this defines the lower bounds.
    • arma::vec upper_bounds this defines the upper bounds.

  • int gd_method which updating formula to use; see the details section below.
  • gd_settings_t gd_settings a structure containing options for the available GD algorithms; see the details section below.

Details

Let $x^{(i)}$ denote the values at stage $i$ of the algorithm, and define $f^{(i)} := f (x^{(i)})$. The Gradient Descent (GD) updating rule is:

$$x^{(i+1)} = x^{(i)} - d^{(i)}$$

where the descent direction $d^{(i)}$ is computed using one of several approaches described below, depending upon the choice of gd_method.

  • gd_method = 0: Basic GD
  • $$d^{(i)} = \alpha \times \nabla f^{(i)}$$ where $\alpha$, the step size (also known the learning rate), is set by step_size.
  • gd_method = 1: GD with Momentum
  • $$d^{(i)} = \mu \times d^{(i-1)} + \alpha \times \nabla f^{(i)}$$ where $\mu$ (the momentum parameter) is set by momentum_par.
  • gd_method = 2: Nesterov accelerated gradient descent (NAG)
  • $$d^{(i)} = \mu \times d^{(i-1)} + \alpha \times \nabla f( x^{(i)} - \mu \times d^{(i-1)})$$
  • gd_method = 3: AdaGrad
  • $$v^{(i)} = v^{(i-1)} + \nabla f^{(i)} \odot \nabla f^{(i)}$$ $$d^{(i)} = \nabla f^{(i)} / ( \sqrt{v^{(i)}} + \epsilon) $$ where $v^{(0)} = \mathbf{0}$; $\odot$ denotes the Hadamard (element-by-element) product; and $\epsilon$ is a small normalization term, set by norm_term.
  • gd_method = 4: RMSProp
  • $$v^{(i)} = \rho \times v^{(i-1)} + (1-\rho) \times \nabla f^{(i)} \odot \nabla f^{(i)}$$ $$d^{(i)} = \nabla f^{(i)} / ( \sqrt{v^{(i)}} + \epsilon) $$ where $\rho$ is set by ada_rho.
  • gd_method = 5: AdaDelta
  • $$v^{(i)} = \rho \times v^{(i-1)} + (1-\rho) \times \nabla f^{(i)} \odot \nabla f^{(i)}$$ $$d^{(i)} = \nabla f^{(i)} \dfrac{\sqrt{m^{(i)}} + \epsilon}{\sqrt{v^{(i)}} + \epsilon} $$ $$m^{(i+1)} = \rho \times m^{(i)} + (1-\rho) \times d^{(i)} \odot d^{(i)}$$ where $m^{(0)} = \alpha$.
  • gd_method = 6: Adam (Adaptive Moment Estimation) and AdaMax
  • $$m^{(i+1)} = \beta_1 \times m^{(i)} + (1-\beta_1) \times \nabla f^{(i)}$$ $$v^{(i)} = \beta_2 \times v^{(i-1)} + (1-\beta_2) \times \nabla f^{(i)} \odot \nabla f^{(i)}$$ $$\hat{m} = \dfrac{m^{(i)}}{1 - \beta_1^i}, \ \ \hat{v} = \dfrac{v^{(i)}}{1 - \beta_2^i}$$ where $m^{(0)} = \mathbf{0}$, and $\beta_1$ and $\beta_2$ are set by adam_beta_1 and adam_beta_2, respectively. Then $$d^{(i)} = \alpha \times \dfrac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}$$
    • If ada_max = true, then the updating rule for $v^{(i)}$ is no longer based on the $L_2$ norm; instead:
    • $$ v^{(i)} = \max \left\{ \beta_2 \times v^{(i-1)}, |\nabla f^{(i)}| \right\} $$ and $$d^{(i)} = \alpha \times \dfrac{\hat{m}}{v^{(i)}+ \epsilon}$$
  • gd_method = 7: Nadam and NadaMax
  • $$m^{(i+1)} = \beta_1 \times m^{(i)} + (1-\beta_1) \times \nabla f^{(i)}$$ $$v^{(i)} = \beta_2 \times v^{(i-1)} + (1-\beta_2) \times \nabla f^{(i)} \odot \nabla f^{(i)}$$ $$\hat{m} = \dfrac{m^{(i)}}{1 - \beta_1^i}, \ \ \hat{v} = \dfrac{v^{(i)}}{1 - \beta_2^i}, \ \ \hat{g} = \dfrac{\nabla f^{(i)}}{1 - \beta_1^i}$$ $$d^{(i)} = \alpha \times \nabla f^{(i)} \odot \dfrac{\beta_1 \hat{m} + (1 - \beta_1) \hat{g}}{\sqrt{v^{(i)}} + \epsilon} $$
    • If ada_max = true, then the updating rule for $v^{(i)}$ is no longer based on the $L_2$ norm; instead:
    • $$ v^{(i)} = \max \left\{ \beta_2 \times v^{(i-1)}, |\nabla f^{(i)}| \right\} $$ and $$d^{(i)} = \alpha \times \nabla f^{(i)} \odot \dfrac{\beta_1 \hat{m} + (1 - \beta_1) \hat{g}}{v^{(i)} + \epsilon} $$

The algorithm stops when $\| \nabla f \|$ is less than err_tol, or the total number of iterations exceeds iter_max.


Examples

As an example, consider maximum likelihood estimation of a logit model, common in statistics and machine learning. In this case we have closed-form expressions for the gradient and hessian. We will compare Adam (Adaptive Moment Estimation) to a pure Newton-based algorithm.

#include "optim.hpp"

// sigmoid function

inline
arma::mat sigm(const arma::mat& X)
{
    return 1.0 / (1.0 + arma::exp(-X));
}

// log-likelihood function data

struct ll_data_t
{
    arma::vec Y;
    arma::mat X;
};

// log-likelihood function with hessian

double ll_fn_whess(const arma::vec& vals_inp, arma::vec* grad_out, arma::mat* hess_out, void* opt_data)
{
    ll_data_t* objfn_data = reinterpret_cast<ll_data_t*>(opt_data);

    arma::vec Y = objfn_data->Y;
    arma::mat X = objfn_data->X;

    arma::vec mu = sigm(X*vals_inp);

    const double norm_term = static_cast<double>(Y.n_elem);

    const double obj_val = - arma::accu( Y%arma::log(mu) + (1.0-Y)%arma::log(1.0-mu) ) / norm_term;

    //

    if (grad_out)
    {
        *grad_out = X.t() * (mu - Y) / norm_term;
    }

    //

    if (hess_out)
    {
        arma::mat S = arma::diagmat( mu%(1.0-mu) );
        *hess_out = X.t() * S * X / norm_term;
    }

    //

    return obj_val;
}

// log-likelihood function for Adam

double ll_fn(const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)
{
    return ll_fn_whess(vals_inp,grad_out,nullptr,opt_data);
}

//

int main()
{
    int n_dim = 5;     // dimension of parameter vector
    int n_samp = 4000; // sample length

    arma::mat X = arma::randn(n_samp,n_dim);
    arma::vec theta_0 = 1.0 + 3.0*arma::randu(n_dim,1);

    arma::vec mu = sigm(X*theta_0);

    arma::vec Y(n_samp);

    for (int i=0; i < n_samp; i++)
    {
        Y(i) = ( arma::as_scalar(arma::randu(1)) < mu(i) ) ? 1.0 : 0.0;
    }

    // fn data and initial values

    ll_data_t opt_data;
    opt_data.Y = std::move(Y);
    opt_data.X = std::move(X);

    arma::vec x = arma::ones(n_dim,1) + 1.0; // initial values

    // run Adam-based optim

    optim::algo_settings_t settings;

    settings.gd_method = 6;
    settings.gd_settings.step_size = 0.1;
 
    bool success = optim::gd(x,ll_fn,&opt_data,settings);
 
    arma::cout << "\nAdam: true values vs estimates:\n" << arma::join_rows(theta_0,x) << arma::endl;

    //
    // run Newton-based optim

    x = arma::ones(n_dim,1) + 1.0; // initial values
 
    success = optim::newton(x,ll_fn_whess,&opt_data);
 
    arma::cout << "\nnewton: true values vs estimates:\n" << arma::join_rows(theta_0,x) << arma::endl;

    //
 
    return 0;
}
Solution with timings:
Adam: logit_reg test completed successfully.
elapsed time: 0.025128s

Adam: true values vs estimates:
   2.7850   2.6993
   3.6561   3.6798
   2.3379   2.3860
   2.3167   2.4313
   2.2465   2.3064

newton: logit_reg test completed successfully.
elapsed time: 0.255909s

newton: true values vs estimates:
   2.7850   2.6993
   3.6561   3.6798
   2.3379   2.3860
   2.3167   2.4313
   2.2465   2.3064
}