Newer
Older
Master_thesis / thesis / minimization.tex
\subsection{Minimisation}
\label{sec:minimisation}

The optimisation of functions is a large topic by itself and a lot of 
implementations of different algorithms exist. They usually need a function 
that returns a value, such as the loss, which depends on the parameter values 
that they use as arguments. Since the computation of the loss is efficiently 
implemented in \zfit{} thanks to TF and this usually is the heavy part of the 
minimisation, in practice any minimiser can be wrapped easily.
In \zfit{} several minimiser algorithms are implemented by wrapping existing 
libraries and giving them a common API. An important part of the design is that 
the creation of a minimiser object does neither execute any minimiser function 
nor 
tie itself to a specific loss, it's simply the configuration of the minimiser. 
This implies that the minimiser is stateless, and to actually invoke it, the 
\pyth{minimise} method needs to be called. The information about the 
minimisation procedure and the parameter values are collected in a 
\pyth{FitResult} and returned by the minimiser.

\subsubsection{Different optimisations}

While a whole variety of algorithms exists, not all are equally feasible to be 
used for model fitting as done in \zfit{}. There are various distinctions that 
influence the choice of a certain minimiser.

\begin{description}
	\item[Derivative] Some optimisers use the derivatives and others 
	don't. In general, using the derivative provides a huge advantage since it 
	tells 
	about the local shape of the function. This requires though that the 
	function to minimise be continuous, which is not always the case. For model 
	fitting, 
	functions are continuous and, thanks to TF, \zfit{} uses an analytic 
	expression for them.
	
	\item[Global/Local] Some optimisers are better in finding a global minimum 
	by doing a variation of a large grid search. Others focus on a local 
	minimum by using a starting point and going along a path to the next 
	minimum. The latter is more accurate and faster \textit{if} a good initial 
	parameter estimation is given, so that the minimiser does not start far 
	from the true minimum. In model fitting, a reasonable estimate can often be 
	made.
	
	\item[Dimensionality] The number of parameters that have to be tweaked in 
	order to minimise the loss has an impact on the strategy. While the Hessian 
	matrix can help greatly with the minimisation, it has $n_{params}^2$ 
	entries. This renders its usage impossible for hundreds of thousands of 
	parameters as used in deep learning. For model fitting with a maximum of a 
	few hundreds and typically around a dozen of parameters, using the Hessian 
	is feasible.
\end{description}

In HEP model fitting, mostly local, second order derivative minimisers are 
used. An 
important algorithm is the Newton method of minimisation: leaving the 
mathematical details away, the algorithm performs a second order Taylor 
approximation of the function using the exact Hessian. Assuming that the target 
is a saddle point with the derivate equal to zero, the equation is solved and 
the algorithm jumps to the estimated minimum. Since this solution is only the 
true minimum if the second order approximation were exact, this step is 
repeated until some convergence criteria are fulfilled.

Newton's method is often not directly used since the computation of the Hessian 
and its inverse can be computationally expensive. Instead an approximation of 
the inverse Hessian is calculated and updated on every minimisation step, 
giving rise to a family of methods called Quasi-Newton methods. Since these 
updates bring some 
ambiguity in higher dimensions, different methods use different assumptions on 
the updates. A prominent example is the BFGS algorithm and its low memory 
variant L-BFGS, which only stores the most recent steps. L-BFGS is also the 
method
implemented in Minuit, the most common used minimiser in frameworks like 
\roofit or \root.

Another important point is the stopping criteria. While Minuit has its own 
stopping criteria and 
in general good criteria have to be found with experience, this tuning for 
other minimisers is still ongoing work in \zfit{}. It is though simply possible 
to define and change this 
criteria for a minimiser.

A special mention has to be made about TF optimisers. As discussed above, for 
deep learning first order algorithms are used relying on the so called 
``gradient descent" technique. This algorithm simply 
follows the negative gradient iteratively. Advanced variations alter the step 
size based on heuristics but are still largely 
inferior for low dimensional problems in comparison with Quasi-Newton methods. 
A wrapper for any TF optimiser is available 
in \zfit{}, an example is provided using the implementation of the 
Adam\cite{kingma2014adam} optimiser. Simple benchmarks though yield an order of 
magnitude increase in the number of minimization steps compared to Quasi-Newton 
methods, they are not competitive.
to the