\subsection{TensorFlow backend} \label{sec:implementation} Deep learning has recently gained a lot of attention as it has been quite successful as a tool in big data analysis and predictive statistics. In its core, the idea is to extract correlations from data by training a neural network on them in order to make accurate predictions on unknown samples. This can be the classification of images, the prediction of stock markets etc. More interestingly, the typical deep learning workflow can be summarised also by Fig. \ref{fig:fit_workflow} by replacing ``model" with ``neural network", \footnote{Neural networks in deep learning are also called models. The term ``model" will here solely be used for ``classical" model fitting as in \zfit{}.} ``minimisation" with ``training" and removing the last block ``Result \& Error". Moreover, deep learning and HEP model fitting both use large data samples and build complex models. This similarity inspired the implementation of \zfit{} with a deep learning framework as the backend. While we have just shown how the two workflows look incredibly similar at first glance, there are some hidden, crucial differences. Knowing them is essential in order to understand the advantages but also limitations of this approach. In the following we will have a simplified at the core of deep learning and compare then to model fitting. \begin{itemize} \item A Deep Neural Network (DNN) is simply a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$. The input space is the number of observables of a single event. The output of the function, the prediction, is notably $m$ dimensional. Contrary to this, a model with PDFs outputs into the one dimensional space of a normalised probability. DNN outputs, if used for classification, correspond to pseudo-probabilities. While they are not normalised, there exists a monotonic transformation to a probability. \textit{Consequence}: normalisation over a certain range is specific to model fitting and no explicit tool, e.g. for numerical integration, exists in the deep learning frameworks. \item In model fitting, the composition of shapes is motivated by previous knowledge and an underlying theory is built. There is often a meaning behind each part and the shape of the model is restricted to a specific problem. This specific shape, coming from assumptions and previous knowledge, is what keeps the number of parameters low: typically, no more than a dozen for simple fits and a maximum of a few hundreds for the most complicated fits are used. \begin{figure}[tbp] \centering \includegraphics[width=0.5\textwidth]{images/neural_network_scheme.png} \caption{A schematic view of a DNN function. Input takes the data and has the dimension of an event. Each end-point of a line is multiplied by a weight, a free parameter, and added with the other lines. At a node, a non-linear function is then applied.} \label{fig:dnn} \end{figure} Contrary to that, the structure of a DNN is basically agnostic to the problem and depends mostly on its complexity. It therefore incorporates a minimum of pre-knowledge and assumptions about the correlations in the data. This huge versatility is what makes deep learning such a successful field but comes at a price: a great number of free parameters is needed, starting at thousands for really simple DNNs, typically being around hundreds of thousands and going to tens of millions. DNNs are a structure consisting of layers with nodes as shown in Fig. \ref{fig:dnn}. Each of the nodes adds an additional parameter for every incoming connection, resulting in a large number of parameters. \textit{Consequence}: While dozens of DNN building libraries like Keras\footnote{Keras is an API specification only, a reference implementation exists.}, PyTorch and more offer great capabilities in building DNNs, their abstractions into layers are of no use to model fitting. \item A DNN is essentially matrix multiplications scaled by the free parameters, with an additional simple, non-linear activation function applied. In model fitting, the shape can have an arbitrary complexity and contain a whole range of elementary functions. Furthermore control flow elements and complex number are often used as well. Not only is the function itself more complicated, also the dependency of parameters can be highly non-trivial. There is an additional difference in the precision of the floating point operations. In model fitting, the precision required is higher, because the values of likelihoods and the changes are larger compared with neural networks often having values varying between $-1$ and $1$. Also the Quasi-Newton methods as described below build an approximate second order derivatives which needs a high enough precision. \textit{Consequence}: While both fields do heavy computations, the focus of the optimizations in a fitting library is slightly different and requires for example to always explicitly specify float64 as data type. \item Minimising a loss is a non-trivial task. Algorithms usually start at a certain point and use local information to make forward steps. The gradient and sometimes higher order derivatives, usually up to the second order, are used to help finding the minimum. In particular, which order is usable strongly depends on the number of parameters: the Hessian matrix of $n$ parameters has $n^2$ entries rendering its calculation unfeasible for more than a few hundred of parameters; this restricts the minimisation of DNNs to only use the first order derivatives at the cost of more required steps. \textit{Consequence}: On one side, minimisers designed for DNNs and optimised to work with the framework are not suitable for model fitting. On the other side the analytic gradients that are provided by th frameworks for their minimisers can be easily extended to higher orders and come in very handy for model fitting minimisers. \item Fitting a model has the goal to find the parameters for which the model matches the data best. In terms of a loss function, this is equivalent to finding its \textit{global} minimum. Being stuck in a local minimum is a problem and requires careful treatment. Contrary in DNNs the global minimum is not found but also not desired. The DNN has to approximate an arbitrary data sample \textit{good enough} and a local minimum is usually found, it is in fact preferred over the global minimum since, due to the high degrees of freedom of a DNN, this typically entails a huge over-fit\footnote{Basically remembering every noise in the data.} and a bad generalization. \textit{Consequence}: While finding the global minimum is crucial in model fitting, deep learning is interested to find a local minimum. \item Finding the global minimum in the model fitting case involves the evaluation of the loss over the whole data sample at every step. For the training of a DNN, variations of a technique called stochastic gradient descent (SGD) are used: they work by only evaluating a small mini-batch of the data, typically around 32 events, and then taking a step towards the negative gradient direction. \textit{Consequence}: Both fields are pushing the limits of handling big data and samples too large for the memory are common. Deep learning is optimized to loop through a data set with very small batch sizes to take a minimisation step but to do that millions of times. Model fitting needs to loop through the whole data sample \textit{once} for a single step but no more then a few thousand times. \end{itemize} As seen, while there are differences, the core of the problem is still the same: build a complicated model, use data to get a value from it and tune parameters to optimize the loss. To accomplish this task efficiently, deep learning frameworks use a declarative paradigm by building a computation graph as seen in Fig. \ref{fig:graph_example}. This allows to perform optimisations and define the parallelisation \textit{before} the actual execution of the computation. Furthermore, it also allows to get an analytic expression for the gradient by consecutively applying the chain rule within the graph. \begin{figure}[tbp] \centering \includegraphics[width=0.35\textwidth]{figs/graph_example.png} \caption{Example of a graph representing $result = (5 + 3) * (7 + 2)$.} \label{fig:graph_example} \end{figure} To implement model fitting in \zfit{}, the TensorFlow library was chosen, restricting the implementation to static graphs as explained in more detail in Appendix \ref{appendix:tensorflow}. The main motivation for this decision comes from the fact that models in model fitting \textit{usually} don't change their logic but are rather built once and then minimised. While the same is true for most DNNs, more advanced fields in deep learning like reinforcement learning can heavily rely on dynamic models. The main advantages of a static graph are the additional, potential speedup and the immutability, which leads to less unexpected behaviour. The restriction that \textit{anything built will remain like that and not change} allows for obviously more efficient optimizations compared to a graph where any part could change anytime and a re-analysis of the graph is required. Working with graphs leads to difficulties and unexpected behaviours in comparison with more traditional, non-graph based code, such as the one used in \roofit. In \zfit{}, most of these complications are hidden from the user and the library offers a similar user experience as the one found in other model fitting libraries. This requires some extra care taking behind the scenes, as explained in Appendix \ref{appendix:tensorflow}.