\section{Performance} \label{sec:performance} In the previous Sections, the structure and logic of model fitting in \zfit{} was elaborated. Apart from the functionality, model fitting involves a lot of number crunching and thus state-of-the-art performance is a hard requirement. The performance of code depends on several factors and can often be divided into a serial and a parallel part. The efficient implementation of parallelised parts of code are as important as deciding \textit{when} to actually run in parallel, as discussed in Appendix \ref{appendix:profiling tensorflow}. While this part is mostly left to TF, the fact that some pieces of code simply \textit{are not} parallelisable, and therefore cause a bottleneck, depends on the nature of the problem itself. In real code, the two are not clearly separated and a mixture of both influence the actual performance. In this Section we will focus on quantifying the performance of \zfit{} by measuring the execution time of the whole process, mixing together the performance of TF and \zfit{} with the bottlenecks coming inherently from model fitting. While this limits the ability to draw conclusions about bottlenecks and remove them, it reflects the performance of the library as experienced in real life usage. In the following, two studies are presented: on one hand, a more artificial but well scalable example consisting of Gaussian models, and on the other hand, a more realistic case, namely a fit to an angular distribution is performed. In both cases the performance of toy studies is measured and compared. The hardware specifications can be found in Appendix \ref{appendix:hardware specs}. \subsection{Gaussian models} \label{sec:perf gaussian models} There are three quantities of interest to describe the scaling of the library: the complexity of the model, the number of free parameters and the size of the data sample used to fit. In this study, a sum of Gaussian PDFs is used as the model. The fractions are constant and the mean and width are parameters, each shifted by a different constant, and either all depending on only two parameters or scaling with the number of Gaussians. This model is used to perform toy studies. First a sample from the model is created using a fixed, initial parameter setup. Then the parameters are randomized and a fit to the sampled data is performed. The implementation of sampling and the minimisation are two distinct steps and for fits to data, only the latter is actually invoked. Since this combines two execution time measurements making reasoning even harder and there is a known, temporary\footnote{This problem is resolved now, though in this thesis the old measurements are still used since the conclusions are the same.} bottleneck in \zfit{} sampling, only the execution time of the minimisation is measured. A approximate comparison of the current performance with sampling can be found in Appendix \ref{appendix:additional profiling}. For each setup, twenty toys are run ranging from 2 up to 20 Gaussians with sample sizes from 128 to 8 million events (except for GPU, where it goes only up to 4 million\footnote{The GPU used has a comparably small memory. Performing larger-then-memory computations and multi-GPU is still work in progress.}). A comparison with an implementation in \roofit is performed, although only ranging from 2 to 9 Gaussians.\footnote{Due to technical problems using a sum of more than 9 pdfs with \roofit that were overcame only after the measurements were done.} To have a fair comparison, both use the Minuit algorithm for minimisation. In the following, 4 cases are considered: \begin{itemize} \item an implementation in \zfit{}, labelled as ``zfit CPU'', using the analytic gradient provided by TF. In this case, an initial run is done to remove the graph compile time. While not significant, this provides a more realistic estimation. \item The same implementation as above is used but the analytic gradients provided by TF are disabled, denoted by the addition ``nograd''. Instead, the Minuit minimiser calculates a numerical approximation of the gradients internally. \item The same as ``zfit CPU'' but run on a GPU and labelled as ``zfit GPU''. \item An implementation in \roofit using the Python bindings with PyROOT. The parallelisation is done when invoking the \pyth{fitTo} method and equals the number of cores available. \end{itemize} \begin{figure}[tbp] \begin{subfigure}{.5\textwidth} \label{fig:gperf_left} \includegraphics[width=\textwidth]{figs/gperf/loglog_9gauss_across_2param.png} \caption{2 free parameters} \end{subfigure}% \begin{subfigure}{.5\textwidth} \includegraphics[width=\textwidth]{figs/gperf/loglog_9gauss_across_9param.png} \caption{18 free parameters} \end{subfigure}% \caption{A sum of 9 Gaussian PDFs with shared (left) or individual (right) mu and sigma. On the y axes, the time for a single fit is shown, averaged over 20 fits. It is plotted against the number of events that have been drawn per toy.} \label{fig:gperf} \end{figure} In Fig. \ref{fig:gperf}, the time per toy is measured and plotted against the number of events used in the toy. While \zfit{} on CPU outperforms \roofit in the left plot with only two free parameters, in the right plot with 18 free parameters, \roofit performs significantly faster with a low number of events. This comes from the fact that while the Minuit minimiser is used in both measurements, their are tweaked differently: the version used by \roofit is configured to better cope with a bumpy likelihood as given with a high number of parameters and a low number of events, which results in a vastly superior performance for complicated, low statistics cases while reducing the performance in simpler cases. It is to note that tweaking the minimiser is still ongoing effort in \zfit{} and the performance behaviour is likely to change in the future. Different tuning cannot simply be quantified and compared, since not only the number of evaluations of the loss function, but also the gradient and the (very expensive) Hessian computation is part of the strategy. This limits the conclusions that can be drawn from the comparison of the performance with a low number of events. Furthermore, the minimiser for 18 free parameters using \zfit{} was not able to converge for more than a million of events \textit{without the automatic gradient from TF}, therefore no data points are available there.\footnote{This indicates numerical instabilities due to limited precision and there are ways to circumvent them which are planned to be implemented in \zfit{}.} The comparison in Fig. \ref{fig:gperf} still reveals a few important things that agree very well with the expectations. \begin{itemize} \item As the number of events increase, the execution time of \roofit monotonically increases. \item There is no advantage using parallelisation for very few calculations since the overhead of splitting and collecting the results is dominant. With increasing number of events this gets negligible and as an effect the execution time of \zfit{} increases way slower than for \roofit. This also comes from the fact that more events mean a more stable loss shape, so the minimiser used in \zfit{} performs better. \item The GPU is highly efficient in computing thousands of events in parallel. For only a few data points though, the overhead of moving data back and forth dominates strongly, making it unfeasible for only a small number of events. A continuous decrease of the time up to 10'000 in the left plot of Fig. \ref{fig:gperf} events confirms that and even shows a drop in the computation time. \end{itemize} As a conclusion, the speed for very small examples around 100 events of \zfit{} is \textit{marginally} slower than the corresponding \roofit example. For larger fits, the speedup of \zfit{} is up to a factor of ten at around a million events. For more complicated fits and a small number of events, \roofit is an order of magnitude faster because of its currently better tuned minimiser, though there is ongoing work in \zfit{}. The GPU delivers in larger examples a similar performance as the multicore setup. Finally, not using the TF gradient yields only a minor penalty for large fits and can even be faster for small ones, experiments have shown that the effect of an increased time can be larger for complicated, real use cases and helps reducing the number of steps required to take by the minimiser. \begin{figure}[tbp] \begin{subfigure}{.5\textwidth} \includegraphics[width=\textwidth]{figs/gperf/time_vs_nevents_2param_roofit.png} \caption{\roofit} \end{subfigure}% \begin{subfigure}{.5\textwidth} \includegraphics[width=\textwidth]{figs/gperf/time_vs_nevents_2param_zfit.png} \caption{\zfit{} CPU} \end{subfigure}% \caption{Measurement of the computation time with a sum of $n\_{gauss}$ Gaussians and in total two free parameters. Notice that the y-scale is the same for both plots but the x-axis for \zfit{} goes an order of magnitude higher. Also, \zfit{} sums up to 16 Gaussians whereas \roofit only goes to 9.} \label{fig:time events} \end{figure} In Fig.\ref{fig:time events}, a comparison of what can be achieved within a certain time frame is shown. The fits scale with the number of events for different number of added Gaussians, having only two free parameters \textit{in total}. The observation matches the intuitive behaviour of scaling with complexity and size of the sample. Compared to \roofit, the fitting time of \zfit{} increases an order of magnitude less; note that both time scales end at 100 seconds. In this time, \zfit{} fits 8 millions of events with 16 Gaussians, \roofit does half a million with 9 Gaussians. For this setup, 8 CPUs on a shared cluster were used, leaving slight ambiguity about the results due to the unknown configuration and actual workload on the machine. However, the order of magnitude matches with the results in Fig. \ref{fig:gperf}, which were performed in a clean environment. \begin{figure}[tbp] \centering \includegraphics[width=0.5\textwidth]{figs/gperf/time_vs_nevents_nparam_zfit.png} \caption{Time of a single toy in dependence of the number of events used. Plotted for a sum of $n\_{gauss}$ Gaussians and two free parameters \textit{per Gaussian}.} \label{fig:time events nparam} \end{figure} Scaling the number of free parameters with the number of Gaussians added is shown in Fig.\ref{fig:time events nparam}. We see clearly that for a large number of parameters having a higher number of events can actually be more performant on an absolute scale, at least up to a certain threshold, since this reduces the number of steps to be taken for the minimisation. \subsection{Angular analysis} \label{sec:angular analysis} In order to study the scalability of \zfit{} in a challenging, realistic setting, a toy study from an ongoing analysis involving \zfit{} is used. The example, which consists in the angular analysis of $\decay{\Bz}{\Kstar(\decay{}{\Kp\pim})\lepton^+\lepton^-}$ with $\lepton$ being either $\electron$ or $\muon$, is an important legacy analysis and the fact that it was implementable in \zfit{} is itself already an achievement. The model is from the folded angular analysis. The folding to measure the P5' parameter as defined in \cite{Aaij:2015oid} is implemented. The resulting model is given by \begin{align} \label{eq:kstangular} f_{angular}(\theta_{\kaon}, \theta_{\lepton}, \phi; F_L, A_T^{(2)}, P_5') =~&\frac{3}{4} (1 - F_L) \sin^2\theta_{\kaon} + F_L\cos^2\theta_{\kaon} \nonumber \\ &+ \frac{1}{4}(1 - F_L)\sin^2\theta_{\kaon}\cos2\theta_{\lepton} \nonumber \\ &- F_L\cos^2\theta_{\kaon} \cos 2\theta_{\lepton} \\ &+ S_3 \sin^2\theta_{\kaon}\sin^2\theta_{\lepton}\cos 2\phi \nonumber \\ &+ S_5 \sin 2\theta_{\kaon}\sin\theta_{\lepton}\cos \phi \nonumber \end{align} with \begin{align*} S_3 &= \frac{1}{2} A_T^{(2)}(1 - F_L)(1 - F_L) \\ S_5 &= P_5'\sqrt{F_L(1 - F_L)}. \end{align*} \begin{figure}[tbp] \centering \includegraphics[width=0.85\textwidth]{figs/kst_perf/SketchDecay.png} \caption{Schematic view of the angles of the $\decay{\Bz}{\Kstar(\decay{}{\Kp\pim})\lepton^+\lepton^-}$ decay. The image was taken from\cite{Khachatryan:2038750}} \label{fig:kst_angles} \end{figure} The model depends on three angles, shown in Fig. \ref{fig:kst_angles}, where \begin{minipage}{\textwidth} \begin{itemize} \item $\phi$ is the angle between the plane spanned by the flight direction of the two leptons with the plane spanned by the kaon and pion, \item $\theta_{\kaon}$ is the angle between the kaon and the negative flight direction of the \Bz and \item $\theta_{\lepton}$ is the angle between the $\lepton^+$ ($\lepton^-$) and the negative flight direction of the \Bz. \end{itemize} \end{minipage} \newline The model is extended to four dimensions by adding a description of the \B invariant mass distribution by building the product with the angular part as defined in Eq. \ref{eq:kstangular}. The model used for the mass shape is a combination of two Gaussians, each with a powerlaw tail. The implementation of the angular part is straight forward and follows the example in Sec. \ref{sec:quickstart}. Using the implemented \pyth{DoubleCB} for the mass, the four dimensional model can be built as \begin{center} \begin{minipage}{\textwidth} \begin{python} limit_thetak = zfit.Space('thetak', limits=...) limit_thetal = zfit.Space('thetal', limits=...) limit_phi = zfit.Space('phi', limits=...) limit_bmass = zfit.Space('bmass', limits=...) angular_obs = limit_thetak * limit_thetal * limit_phi angular = AngularPDF(obs=angular_obs, ...) mass = zfit.pdf.DoubleCB(obs=limit_bmass, ...) model = angular * mass \end{python} \end{minipage} \end{center} With this model a set of toy studies is performed, varying in each of them the number of events while keeping the number of free parameters fixed to nine. \begin{figure}[tbp] \begin{subfigure}{.5\textwidth} \includegraphics[width=\textwidth]{figs/kst_perf/loglog_plot.png} \caption{Double log scale} \end{subfigure}% \begin{subfigure}{.5\textwidth} \includegraphics[width=\textwidth]{figs/kst_perf/plot_normal.png} \caption{Linear scale} \end{subfigure}% \caption{The figures show the time needed per toy for the four dimensional P5' folded angular distribution. 25 toys are produced for each number of events.} \label{fig:kst_angular_performance} \end{figure} Fig. \ref{fig:kst_angular_performance} shows the performance of the toys on a shared cluster with 8 CPUs requested and we can see that the time per toy increases slightly sublinearly with the number of events. An interesting number is the average time for toys with around 1000 events, the expected number of events to be seen in the data and therefore actually used in the real toy studies, which is around one second per toy. Both go well together with the expectations of a model fitting library. The whole example demonstrates the suitability of \zfit{} to be used in non-trivial, real world analyses.