Newer
Older
Master_thesis / thesis / implementation.tex
\subsection{TensorFlow backend}
\label{sec:implementation}

Deep learning has recently gained a lot of attention as it has been quite 
successful
as a tool in big data analysis and predictive statistics. In its core, the idea 
is to extract correlations from data by training a neural network on 
them in order to make 
accurate predictions on unknown samples. This can be the classification of
images, the prediction of stock markets etc. More interestingly, the typical 
deep learning workflow 
can be summarised also by Fig.
\ref{fig:fit_workflow} by replacing ``model" with ``neural network",
\footnote{Neural networks in deep learning are also called models. The term 
	``model" will here solely be used for ``classical" model fitting as in 
	\zfit{}.}
``minimisation" with
``training"
and removing the last block ``Result \& Error". Moreover, deep learning and HEP 
model fitting both use large data samples and build complex models. This 
similarity inspired the implementation of \zfit{} with a deep learning 
framework as the backend. 

While we have just shown how the two workflows look incredibly similar at first 
glance, there are 
some hidden, crucial differences.
Knowing them is essential in order to understand the advantages but 
also limitations of this approach. In the following we will have a simplified
at the core of deep learning and compare then to model fitting.

\begin{itemize}
	\item A Deep Neural Network (DNN) is simply a function $f: 
	\mathbb{R}^n \rightarrow \mathbb{R}^m$. The input space is the number of 
	observables of a single event. The output of the function, the prediction, 
	is 
	notably $m$ dimensional. Contrary to this, a model with PDFs outputs into 
	the one dimensional space of a normalised probability. DNN outputs, if used 
	for classification, 
	correspond to pseudo-probabilities. While they are not normalised, there 
	exists a monotonic transformation to a probability.

	\textit{Consequence}: normalisation over a certain range is specific to 
	model fitting and no explicit tool, e.g. for numerical integration, exists 
	in the deep learning frameworks.
	
	\item
	In model fitting, the composition of shapes is motivated by previous 
	knowledge 
	and an underlying theory is built. There is often a meaning behind each 
	part and the shape of the model is restricted to a specific problem. This 
	specific shape, coming from assumptions and 
	previous knowledge, is what keeps the number of parameters low: typically, 
	no more than a dozen for simple fits and a maximum of a few hundreds for 
	the most complicated fits are used.
	\begin{figure}[tbp]
		\centering
		\includegraphics[width=0.5\textwidth]{images/neural_network_scheme.png}
		\caption{A schematic view of a DNN function. Input takes the data and 
		has 
			the dimension of an event. Each end-point of a line is multiplied 
			by a 
			weight, a free parameter, and added with the other lines. At a 
			node, a 
			non-linear function is then applied.}
		\label{fig:dnn}
	\end{figure}
	Contrary to that, the structure of a DNN is basically agnostic to the 
	problem and depends mostly on its complexity. It therefore incorporates a 
	minimum of pre-knowledge and assumptions about the correlations in the 
	data. This huge 
	versatility is what makes deep learning such a successful field but comes 
	at a price: a great number of free parameters is needed, starting at 
	thousands for really simple DNNs, typically being around hundreds of 
	thousands and going to tens of millions. DNNs are a structure consisting of 
	layers with nodes as shown in Fig. \ref{fig:dnn}. Each of the nodes adds an 
	additional parameter for every incoming connection, resulting in a large 
	number of parameters.

	\textit{Consequence}: While dozens of DNN building libraries like 
	Keras\footnote{Keras is an API specification only, a 
	reference implementation exists.}, PyTorch and more offer great 
	capabilities in building DNNs, 
	their abstractions into layers are of no use to model fitting.
	
	\item A DNN is essentially matrix 
	multiplications scaled by the free parameters, with an additional simple, 
	non-linear activation function applied. In model fitting, the shape can 
	have an arbitrary complexity and contain a whole range of elementary 
	functions. Furthermore control flow elements and complex number are often 
	used as well. Not only is the function itself more complicated, also the 
	dependency of parameters can be highly non-trivial.
	
	There is an additional difference in the precision of the 
	floating point operations. In model fitting, the 
	precision required is
	higher, because the values of likelihoods and the changes are larger 
	compared with neural networks often having values varying between $-1$ and 
	$1$. Also the Quasi-Newton methods as described below build an approximate 
	second order derivatives which needs a high enough precision.

	
	\textit{Consequence}: While both fields do heavy computations, the focus of 
	the optimizations in a fitting library is slightly different and requires 
	for 
	example to always explicitly specify float64 as data type.
	
	\item Minimising a loss is a non-trivial task. 
	Algorithms usually start at a certain point and use local information to 
	make forward steps. The gradient and sometimes higher order derivatives, 
	usually up to the second order, are used to help finding the 
	minimum. In particular, which order is usable strongly depends on the 
	number of 
	parameters: the 
	Hessian matrix of $n$ parameters has $n^2$ entries rendering its 
	calculation unfeasible for more than a few hundred of parameters; 
	this restricts the minimisation of DNNs to only use the first order 
	derivatives at the cost of more required steps.

	\textit{Consequence}: On one side, minimisers designed for DNNs and 
	optimised to work with the framework are not suitable for model fitting. On 
	the other side the analytic gradients that are provided by th frameworks 
	for their minimisers can be easily extended to higher orders and come in 
	very handy for model fitting minimisers.
	
	\item Fitting a model has the goal to find the 
	parameters for which the model matches the data best. In terms of a loss 
	function, this is equivalent to finding its \textit{global} minimum. Being 
	stuck in a local minimum is a problem and requires careful treatment. 
	Contrary in
	DNNs the global minimum is not found but also not desired. The DNN 
	has to 
	approximate an arbitrary data sample \textit{good enough} and a local 
	minimum is usually found, it is in fact preferred over the  global 
	minimum since, due to the high degrees of freedom of a DNN, this typically 
	entails
	a huge over-fit\footnote{Basically remembering every noise in the 
	data.} and a bad generalization.

	\textit{Consequence}: While finding the global minimum is crucial in model 
	fitting, deep learning is interested to find a local minimum.

	\item
	Finding the global minimum in the model fitting case involves the 
	evaluation of the loss over the whole data sample at every step. For the 
	training 
	of a DNN, variations of a technique called stochastic gradient descent 
	(SGD) are used: they work by only 
	evaluating a small mini-batch of the data, typically around 32 events, and 
	then taking a step towards 
	the negative gradient direction.
	
	\textit{Consequence}: Both fields are pushing the limits of handling big 
	data and samples too large for the memory are common. Deep learning is 
	optimized to loop through a data set with very small batch sizes to take a 
	minimisation step but to do that millions of times. Model fitting needs to 
	loop 
	through the whole data sample \textit{once} for a single step but no more 
	then a few thousand times.
\end{itemize}
 
As seen, while there are differences, the 
core of the problem is still the same: build a complicated model, use data to 
get a value from it and tune parameters to optimize the loss. 

To accomplish this task efficiently, deep learning frameworks use a declarative 
paradigm by building a computation graph as seen in Fig. 
\ref{fig:graph_example}. 
This allows to perform optimisations and define the parallelisation 
\textit{before} the actual execution of the computation. Furthermore, it also 
allows to get an analytic expression for the gradient by consecutively applying 
the chain rule within the graph.

\begin{figure}[tbp]
	\centering
	\includegraphics[width=0.35\textwidth]{figs/graph_example.png}
	\caption{Example of a graph representing $result = (5 + 3) * 
	(7 + 2)$.}
	\label{fig:graph_example}
\end{figure}


To implement model fitting in \zfit{}, the TensorFlow library was chosen, 
restricting the implementation to static graphs as explained in more detail in 
Appendix \ref{appendix:tensorflow}. The main motivation for this decision comes 
from the fact
that models in model fitting \textit{usually} don't change their logic but 
are rather built once and then minimised. While the same is true for most DNNs, 
more 
advanced fields in deep learning like reinforcement learning can heavily rely 
on dynamic models. The main advantages of a static graph are the additional, 
potential speedup and the immutability, which leads to less unexpected 
behaviour. The 
restriction that \textit{anything built will remain like that and not change} 
allows for obviously more efficient optimizations compared to a graph where any 
part could change anytime and a re-analysis of the graph is required.

Working with graphs leads to difficulties and unexpected behaviours 
in comparison with more traditional, non-graph based code, such as the one used 
in \roofit. 
In \zfit{}, most of these complications are hidden from the user and the 
library offers a 
similar user experience as the one found in other model fitting libraries. This 
requires 
some extra care 
taking behind the scenes, as explained in Appendix \ref{appendix:tensorflow}.