1. ## Advanced optimization methods including SGD and 2nd order?

As many people here work with optimization e.g. for neural networks training in data compression, I would like to propose a general thread about such methods.
For example there are dominating 1st order methods like ADAM, while additionally modeling parabola would be great e.g. for estimating step size.
However, there are many difficulties regarding 2nd order methods due to huge dimension and noisy data, e.g.:
- size of Hessian is dim^2, we need to restrict to a subspace - how to choose it?
- how to estimate Hessian from noisy data, invert such noisy Hessian,
- naive Newton method attracts to close gradient=0 point ... which is usually a saddle - how to repair it? Many methods approximate Hessian as positive definite (e.g. Gauss-Newton, Fisher matrix in K-FAC, TONGA) - pretending that very nonconvex function is locally convex ...
Popular overview of 1st order methods: http://ruder.io/optimizing-gradient-descent/
Of 1st and 2nd order methods: https://www.dropbox.com/s/54v8cwqyp7uvddk/SGD.pdf
Any thoughts?  Reply With Quote

2. ## Thanks:

Shelwien (8th June 2019)

3. I tried using a second order method (the extended kalman filter) on the last layer of weights in PAQ8. I wrote about it in section 3.2.3 of my master's thesis: http://www.byronknoll.com/thesis.pdf  Reply With Quote

4. ## Thanks:

Jarek (9th June 2019)

5. Thanks interesting, so you assume that parameters of the model are evolving with Gaussian transitions - why don't just use likelihood instead (as it directly corresponds to compression ratio)?
For example in each step perform likelihood gradient ascend to adapt parameters to local situation.  Reply With Quote

6. I have had success with various approximate 2nd order methods like CG-Descent, etc (basically various forms of nonlinear conjugate gradient descent; http://users.clas.ufl.edu/hager/pape...cg_compare.pdf is one such algorithm; the authors also had a decent survey on various CG methods); L-BFGS and L-BFGS* are another option. These methods represent the Hessian implicitly (and assume p.d.). I have also used SR1 (Hessian is not necessarily p.d., but is symmetric), but only in small dimensions so I am not sure it would translate well. CG-Descent has worked for me in dimensions on the order of 200K-500K parameters and was very good at minimizing number of function evaluations.

(Edit: Just to clarify, SR1 is assumed to converge faster to the true Hessian than CG or L-BFGS, that's the only relevance of it)  Reply With Quote

7. ## Thanks:

Jarek (10th June 2019)

8. Originally Posted by Jarek Thanks interesting, so you assume that parameters of the model are evolving with Gaussian transitions - why don't just use likelihood instead (as it directly corresponds to compression ratio)?
For example in each step perform likelihood gradient ascend to adapt parameters to local situation.
I'm not sure if I can answer your question - my master's supervisor (Nando de Freitas) derived the equations and I am not very familiar with EKF. From what I understand: the loss function *is* cross entropy, which directly corresponds to compression ratio. I think the Gaussian transition function is separate from the loss function.  Reply With Quote

9. I think there is a connection between online optimization and recursive filters (EKF, Unscented filter, etc) - http://probabilistic-numerics.org/as...Bes_Miguez.pdf ; Unfortunately digging thorough control theory textbooks for me is like having a root canal, cannot offer a better explanation.

(Edit: I just realized that the experiments in one of the first UKF papers is actually a neural network: https://www.seas.harvard.edu/courses.../unscented.pdf ; so using EKF and UKF for training neural networks is definitely an established idea)  Reply With Quote

10. I have no experience with Kalman filters - will have to take a closer look.

Stefan, both quasi-Newton (L-BFGS, SR1, DFP ..) and CG assume convexity - attract to saddles, and there is often exp(dim) of them in machine learning - isn't it a problem?
CG seems unpopular in machine learning literature (?), but L-BFGS is quite common.
However, beside the saddle attraction problem, it tries to estimate inverted Hessian based on recent ~10 stochastic gradients, which are very noisy - it seems numerically very unstable, what is also suggested by its requirement of line search (also CG).

Here are some my thoughts on how to get successful 2nd order methods ( https://arxiv.org/pdf/1901.11457 ), I would gladly discuss:
- perform 2nd order modelling only in a few (d << D) dimensional subspace - chosen e.g. by online PCA of recent gradients (suggesting where the action locally is). Additionally, we can still use the remaining (D-d) directions of gradients to simultaneously perform e.g. gradient descent for free there - combining multiple order methods,
- control signs of curvatures to repel from saddles instead of attracting, what gives significant improvements shown by saddle-free Newton: https://arxiv.org/pdf/1406.2572
- Hessian should be estimated online to have better local dependence, preferably from gradient sequence alone. It can be done e.g. by linear regression of gradients (as Hessian is their linear trend), by updating four exponential moving averages: of g, x, gx, xx. Sure quasi-Newton can also estimate from gradients, but from a few - making it numerically unstable. Exponential moving average, linear regression should be much better for extracting statistical trends from noisy gradients.  Reply With Quote

11. Yes, in a stochastic setting I have never seen anybody use CG . I personally think a good CG is on par with L-BFGS when using just a few vectors. I have used CG for non-convex problems, with the understanding that the local minima you end up is the one you get - I don't think any method can give you convergence guarantees on non-convex problems in general (except maybe in an "expected" sense). I think one of the papers I linked to basically links EKF methods with quasi-Newton; The UKF paper shows two ways of putting the network training problem into a state-space setting.  Reply With Quote

12. Originally Posted by Jarek - Hessian should be estimated online to have better local dependence, preferably from gradient sequence alone. It can be done e.g. by linear regression of gradients (as Hessian is their linear trend), by updating four exponential moving averages: of g, x, gx, xx. Sure quasi-Newton can also estimate from gradients, but from a few - making it numerically unstable. Exponential moving average, linear regression should be much better for extracting statistical trends from noisy gradients.
I am not quite sure whether you're proposing this in the limited (d << D) case, where estimating the full Hessian is possible? In the D-dimensions, you can only afford an implicit Hessian (like CG, UKF), low-rank Hessian or inverse Hessian (limited memory quasi-Newton), or structured Hessian (if diagonal counts as "structure"; maybe for certain networks I can imagine a reasonable sparse Hessian). It seems that if you've already projected to a d-dimensional space, a lot of the noise will also be projected out, so maybe you don't need additional strategies to smooth the Hessian further.

I could find several references on "stochastic" quasi-Newton, but I think none of them map directly to what you're trying to do - each was a specialized solution. The success of SGD and the importance of neural networks has not escaped notice in the numerical optimization circles, but sometimes the language is hard to translate between fields.

In your opinion, what will be the benefit of "better" optimization methods - better generalization performance, or just better computational performance? For pure machine learning 1 is ultimately more important than 2, and I think the fact that sometimes more advanced optimization can lead to worse generalization performance has been a very frustrating realization for the deep network folks...  Reply With Quote

13. Generally the landscape in ML problems is extremely complex. A standard "stop and estimate Hessian" approach requires maintaining such single Hessian model for a longer time.
In contrast, inexpensive online estimation, especially based only on the stochastic gradient sequence, would allow for better dependence on local situation.

In how many dimensions 2nd order model would be the most practical?
- in a single direction (d=1): we would get momentum-like method with additional online modelling of parabola in its single direction - allowing for smarter step size, e.g. half the distance to minimum of such modeled parabola,
- complete Hessian (d=D) is completely impractical for ML dimensions in millions. Additionally, most of its eigenvalues are close to zero due to over-parametrization, making huge the problem of Hessian inversion,

- I think d~10 is a reasonable compromise: adds multidimensional model and simultaneous optimization to d=1 in a comparable computational cost.
Choosing these directions by some online PCA of recent gradients, they point where the action really is: sparse directions with curvature far from zero.
So we could update (e.g. online PCA) d~10 size nearly orthogonal basis (v_i), use projections of gradient on this basis to update model of Hessian there, and we can use what has left from this gradient (D-d dim projection) for simultaneous 1st order optimization like gradient descent or something more sophisticated.

For (online) Hessian estimation, I think linear regression of gradients seems promising:
So in 1D, parabola has gradient (derivative): g = lambda(x-p), we can get lambda from linear regression - optimizing least squares error for (noisy!) gradients -
using four averages: of [g], [x], [xg] and [xx], we get lambda = ([gx] - [g][x])/([xx]-[x]^2).
For non-parabola case, we can replace averages with exponential moving averages - weakening weights of the old noisy gradients.
In multiple dimensions we analogously update 4 exponential moving averages, but [g] (like in momentum) and [x] are vectors, [xg], [xx] are matrices.
Now linear regression Hessian estimation is from such covariance matrices:
H = ([gx] - [g] [x]^T) ([xx] - [x] [x]^T)^-1
Have you seen something like this in literature?

Being "better" means mainly faster training, what is extremely important especially for deep NN ... but also not attracting to saddles, not stagnating on suboptimal plateous - e.g. ADAM is known to often lead to suboptimal solution, the saddle-free Newton paper shows that other methods lead to "minima" with lots of strong negative eigenvalues ... Generalization is a separate problem, but related.  Reply With Quote

14. I always wondered if its possible to use statistics to improve optimizer convergence.
I mean, quite frequently we have to find optimal parameters of the same function, just with different input,
so it should be possible to observe the optimization process and improve it based on stats?  Reply With Quote

15. Shelwien, 5 part answer 1) SGD requires statistics extraction as we assume that gradients are noisy.

2) This "optimal parameters of the same function, just with different input" sounds like context-dependent parametric models - neighboring thread: https://encode.su/threads/3124-LOCO-...ny-alternatves

3) Regarding "observe the optimization process and improve it based on stats", often optimum found for one problem is used as the initial point of optimizing for another problem. For example in computer vision to train for a bit different type of images.

4) However, exploiting old optimization path for a new problem seems very problematic - the minimum of loss function is often not a single point, but huge due to:
- symmetry - permutating parameters we often get analogous solution,
- over-parametrization used in deep learning (# parameters >> data size, with regularization to focus on somehow smooth/reasonable models), makes that minima have mostly zero eigenvalues - making minimum practically a submanifold of nearly complete dimension - it seems tough to exploit useful statistical trends from old paths.

5) However, there is a heuristics that initial weights should be chosen randomly - accordingly to statistics of a given type of NN ...  Reply With Quote

16. > often optimum found for one problem is used as the initial point of
> optimizing for another problem.

Yes, but that's too simple.
I'm talking about something like observing that optimal values for some parameter
are always in specific range... just with statistics.

But tracking the optimal parameter values is not the only option -
usually the optimization process itself would also have some constants/parameters.

> there is a heuristics that initial weights should be chosen randomly,
> accordingly to statistics of a given type of NN ...

I try it sometime - these days I'm always running parameter optimization
for some entropy models, so there're many options for experimenting.

My current conclusion on random init is that it never works for me -
optimizer always gets stuck at much worse results than what it finds
starting from previous profiles, which evolved during model development.

Of course, my optimizer is very far from gradient descent,
its more like GA probably.

Actually I did try to convert the entropy model to a function:
But it mostly failed, as it wasn't precise enough, or fast enough.  Reply With Quote

17. You probably could maintain minimum while continuous transformation of parameter - e.g. just interlace tiny parameter step with gradient descent step (or use some implicit function optimization) ...
The optimization has often many own (hyper-)parameters, starting with step size ... but ideally they often should be adapted along the optimization process ... maybe such optimization scheduling could be translated between problems ...

Regarding random init, it is said that it's good to recreate original statistics - probability distributions of individual weights, maybe joint (I have a tool for joint: https://www.dropbox.com/s/7u6f2zpreph6j8o/rapid.pdf ) ...
... but previous result of optimization is probably a better initial point, maybe somehow perturbed.

If by GA you mean gradient ascend (of likelihood), it is practically the same - just switch the sign.

Regarding such parametric models, I have also started with just polynomials and it was disappointing ...
but then I have added optimized powers e.g. |C-B|^0.8, |A-0.5|^4, |C-2B+D|^0.1 in this https://arxiv.org/pdf/1906.03238 ... and the performance has improved a lot.
Generally, such e.g. power function would be put into a table, what allows to further optimize such tabled functions based on a dataset ...  Reply With Quote

18. I think ultimately you need to be able to answer why you think second order information would help you in a situation where you're operating away from the optimal solution most of the time. It may be over-simplifying things, but ADAM is also successful because of learning rate control, which is somewhat analogous to line search - making the most out of a descent direction faster (not taking steps that are too small). You do want individual batch updates to move you in the direction of the gradient (of the current batch) and likely orthogonal to previous gradients (to minimize how much you disturb the solution wrt previous batches). And that starts to look like CG   Reply With Quote

19. ADAM is momentum + strengthening underrepresented coordinates + maintaining inertia + bias for initializing exponential moving average ... nothing like line search.
It doesn't try to model distance to minimum in considered direction - which is in linear trend of gradients, and can be cheaply added into consideration - to increase steps in plateau, decrease in steep valley.
Adding online parabola model in considered direction is going toward line search ... without its additional cost if extracting this parabola from linear trend of gradients.  Reply With Quote

20. Originally Posted by Jarek ADAM is momentum + strengthening underrepresented coordinates + maintaining inertia + bias for initializing exponential moving average ... nothing like line search.
It doesn't try to model distance to minimum in considered direction - which is in linear trend of gradients, and can be cheaply added into consideration - to increase steps in plateau, decrease in steep valley.
Adding online parabola model in considered direction is going toward line search ... without its additional cost if extracting this parabola from linear trend of gradients.
Any learning rate control is "analogous" to line search - it serves a similar purpose; I did not say it is the same. https://arxiv.org/pdf/1902.09843.pdf shows some ways to improve ADAM and it explicitly talks about correcting ADAM's overzealous learning rate. A line search is just one method of step size control, which may be too expensive or a wash to be worth it (eating a few extra function/gradient evaluations to do the search may be just as expensive as just doing a few more iterations). Using momentum is one way of saying that your past update may have been too small (though not the full story either).

Anyway, the literature on optimization is too vast, and I am not sure that's the forum to discuss it in. My professional experience with optimization has always led to (ab)using the unique structure of a problem to specialize a general purpose algorithm almost beyond recognition; it may very well turn out that the same is true for training neural networks - different architectures / cost functions may turn out to "prefer" different optimization algorithms, and that's what we call "job security" in the industry   Reply With Quote

21. ADAM has a few heuristics, but does not try to evaluate distance to minimum, what requires e.g. 2nd order model: of parabola in this considered direction.
We can get it practically for free from linear trend of gradients in the considered direction (e.g. by updating 4 exponential moving averages) - ask where this linear trend intersects 0.
I wasn't able to find such approach in literature - have you seen something like this?  Reply With Quote

22. I am not sure I have seen this; The closest analog I can think of is the Barzilai-Borwein method, which people have already tried with SGD - it is a way to get step size without search. I personally have no experience with it. That gets you into non-monotonic methods (where individual iterations may make the cost or the norm of the gradient grow).

(Edit: Here is a paper from very legit researchers on it: https://arxiv.org/pdf/1605.04131.pdf; again, I can only recommend it based on the authors, have not looked into it carefully).  Reply With Quote

23. ## Thanks:

Jarek (11th June 2019)

24. Thanks, interesting - on the first look seems quite far, but I will look closer.
To get linear trend of gradients from least squares linear regression we need to update 4 (exponential moving) averages (g, x, xg, xx) - if you could think of some paper using such averages?
Momentum uses average of g only, TONGA of gg^T (uncentered covariance matrix), but I haven't met another optimizer exploiting especially this linear trend: xg average ...?  Reply With Quote

25. Originally Posted by Jarek Thanks, interesting - on the first look seems quite far, but I will look closer.
I realize I am waving my hands in the air quite a lot here - BB is just the only method I know of that picks a step size for gradient descent that is not obviously a line search or a Newton-style method I have not seen a method that looks at the trend of the gradients to decide on a step size, but my experience is in more traditional (non-stochastic) optimization, where quasi-Newton/CG methods work quite well even if the problem is not quasi-convex, so I have had the luxury of not needing such trickery... Unfortunately trying to just do a search for "adaptive learning rate" is a very fraught exercise, there's way too much published, and a lot of it of unknown quality.  Reply With Quote

26. I was trying to find something similar in literature, discuss, ask people if they have seen something similar, especially the [xg] average ... but nothing.
So it seems it is a novel approach, the question is if it is promising?
It gives cheap online estimation of Hessian from gradient sequence only, which is optimal in MSE sense with exponentially weakening weights of the old gradients - seems exactly what is needed, I don't know any approach even close to its possibilities (?)
The closest is quasi-Newton, but exponential moving average seems much more better way to extract statistics than estimating Hessian from a few recent noisy gradients.
I know it needs some serious testing, but this is only a general template with lots of details to choose and I am a lone theoretician - learning python and testing it is one of my plans for the summer break.
Anyway, I think it is at least worth some deeper investigation (?), would gladly discuss it, find a collaboration ...  Reply With Quote

27. I wanted to test the approach - learned basics of python, then went to tensorflow ... and got frustrated a bit, have to focus on something else now - will return in a few weeks.
But I have prepared a more accessible paper this cycle - focused the simplest: d=1 case - SGD momentum method with online parabola model in this momentum direction to choose step size - finding linear trend of gradients in this direction using linear regression, what can be done by updating four (exponential moving) averages: of (theta, g, theta * g, theta^2) for theta, g being 1D projections of position and gradient: https://arxiv.org/pdf/1907.07063   Reply With Quote

28. Simple approximation of above modeling minimum as where linear trend of gradients intersects 0, gives looking very practical and universal way of adaptive choice of learning rate 'mu'.

Imagine you would like to minimize in 1D using gradient (g) descent - with steps:
theta <- theta - g * mu
The big question is how to choose learning rate mu - too small and you get slow convergence, too large and you e.g. jump over valleys.

In second order method we would like to find minimum of parabola e.g. in such single step.
For parabola these (theta,g) points are on line as in figure above - we can find its linear coefficient by dividing standard deviations of both - we jump to its minimum if using:

mu = sqrt( variance(theta) / variance(g) )

Getting very simple and pathology resistant (repelling from saddles, avoiding too large steps) adaptive choice of learning rate - requires updating just four (exponential moving) averages: of theta, theta^2, g, g^2.
For example we can cheaply do it for all coordinates in SGD (page 5 of v2) - getting 2nd order learning rate adaptation independently for each parameter.

Standard methods like RMSprop, ADAM use sqrt(mean(g^2)) denominator, but it will not minimize parabola in one step - have anybody seen optimizers using variances like here?  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•