Newer
Older
Master_thesis / thesis / performance.tex
\section{Performance}
\label{sec:performance}

In the previous Sections, the structure and logic of model fitting in \zfit{} 
was elaborated. Apart from the functionality, model fitting involves a lot of 
number crunching and thus state-of-the-art 
performance is a hard requirement.

The performance of code depends on several factors and can often be divided 
into a serial and a parallel part. The efficient implementation of parallelised 
parts of 
code are as important as deciding \textit{when} to actually run in parallel, as 
discussed in Appendix 
\ref{appendix:profiling 
tensorflow}. While this part is mostly left to TF, the fact that some pieces of 
code 
simply \textit{are not} parallelisable, and therefore cause a bottleneck, 
depends on the nature of the problem itself. In real code, the two are 
not clearly 
separated and a mixture of both influence the actual performance.

In this Section we will focus on quantifying 
the performance of \zfit{} by measuring the execution time of the whole 
process, 
mixing together the performance of TF and \zfit{} with the bottlenecks coming 
inherently from model fitting. While this limits the ability to draw 
conclusions about bottlenecks and remove them, it reflects the performance of 
the library as experienced in real life usage. 

In the following, two studies are presented:
on one hand, a more artificial but well scalable example consisting 
of Gaussian models, and on the other hand, a more realistic case, namely a fit 
to an 
angular distribution is 
performed. In both cases the performance of toy studies is measured and 
compared. The hardware specifications can be found in Appendix 
\ref{appendix:hardware specs}.

\subsection{Gaussian models}
\label{sec:perf gaussian models}

There are three quantities of interest 
to describe the scaling of the library: the complexity of the model, the number 
of free parameters and the size of the data sample used to fit. In this study, 
a sum of Gaussian PDFs is used as the model. The fractions are constant 
and the mean and width are parameters, each shifted by a different constant, 
and either all 
depending on only 
two parameters or scaling with the number of Gaussians.

This model is used to perform toy studies. First a sample from the model is 
created using a 
fixed, initial parameter setup. Then the parameters are randomized and a fit to 
the sampled data is performed. The implementation of sampling and the 
minimisation are two distinct steps and for fits to data, only the latter is 
actually invoked. Since this combines two execution time measurements making 
reasoning even harder and there 
is a known, temporary\footnote{This problem is resolved now, though in this 
thesis the old measurements are still used since the conclusions are the same.} 
bottleneck in \zfit{} 
sampling, only the execution time of the 
minimisation is measured. A approximate comparison of the current performance 
with sampling can be found in Appendix \ref{appendix:additional profiling}.

For each setup, twenty toys are run ranging from 
2 up to 20 Gaussians with sample sizes from 128 to 8 million events (except for 
GPU, where it goes only up to 4 million\footnote{The GPU used has a comparably 
small memory. 
Performing larger-then-memory computations and multi-GPU is still work in 
progress.}). A 
comparison with an implementation in \roofit is performed, although only 
ranging from 2 to 9 Gaussians.\footnote{Due to technical problems using a sum 
of 
more than 
9 pdfs with \roofit that were overcame only after the measurements were done.} 
To have a fair comparison, both use the Minuit algorithm for 
minimisation.

In the following, 4 cases are considered:

\begin{itemize}
	\item an implementation in \zfit{}, labelled as ``zfit CPU'', using the 
	analytic 
	gradient provided by TF. In this case, an initial run is done to 
	remove the graph compile time. While not significant, this provides a more 
	realistic estimation.
	
	\item The same implementation as above is used but the 
	analytic gradients provided by TF are disabled, denoted by the addition 
	``nograd''. Instead, the Minuit minimiser 
	calculates a numerical approximation of the gradients internally.
	
	\item The same as ``zfit CPU'' but run on a GPU and labelled as ``zfit 
	GPU''.
	
	\item An implementation in \roofit using the Python bindings with 
	PyROOT. The parallelisation is done when invoking the \pyth{fitTo} method 
	and equals the number of cores available.
\end{itemize}

\begin{figure}[tbp]

	\begin{subfigure}{.5\textwidth}
		\label{fig:gperf_left}
		\includegraphics[width=\textwidth]{figs/gperf/loglog_9gauss_across_2param.png}
		\caption{2 free parameters}
	\end{subfigure}%
	\begin{subfigure}{.5\textwidth}
		\includegraphics[width=\textwidth]{figs/gperf/loglog_9gauss_across_9param.png}
		\caption{18 free parameters}
	\end{subfigure}%
	\caption{A sum of 9 Gaussian PDFs with shared (left) or individual (right) 
	mu and sigma. On the y axes, the time for a single fit is shown, averaged 
	over 20 fits. It is plotted against the number of events that have been 
	drawn per toy.}
	\label{fig:gperf}
\end{figure}

In Fig. \ref{fig:gperf}, the time per toy is measured and plotted against the 
number of events used in the toy. While \zfit{} on CPU outperforms \roofit in 
the left 
plot with only two free parameters, in the right plot with 18 free parameters, 
\roofit performs significantly faster with a low number of events. This 
comes 
from the fact that 
while 
the Minuit minimiser is 
used in both measurements, their are tweaked differently: the version used by 
\roofit is configured to better 
cope 
with a bumpy likelihood as given with a high number of parameters and a low 
number of events, which results in a vastly superior performance for 
complicated, low statistics cases while reducing the performance in simpler 
cases. It is to note that tweaking the minimiser is still ongoing effort in 
\zfit{} and the 
performance behaviour is likely to change in the future. Different tuning 
cannot simply be quantified and compared, since not only the number of 
evaluations of the loss 
function, but also the gradient and the (very expensive) Hessian computation is 
part of the strategy. This limits the conclusions that can be drawn from the 
comparison of the performance with
a low number of events.
Furthermore, the minimiser for 18 free parameters using \zfit{}
was not able to converge for more than a million of events \textit{without the 
automatic gradient from TF}, therefore no data points are available 
there.\footnote{This indicates numerical instabilities due to limited precision 
and there are ways to circumvent them which are planned to be implemented in 
\zfit{}.}

The comparison in Fig. \ref{fig:gperf} still reveals a few important things 
that agree 
very well with the expectations.
\begin{itemize}
	\item As the number of events increase, the execution time of \roofit 
	monotonically increases.
	\item There is no advantage using parallelisation for 
	very few calculations since the overhead of splitting and collecting the 
	results is 
	dominant. With increasing number of events this gets negligible and as an 
	effect the execution time of \zfit{} 
	increases way slower than for \roofit. This also comes from the fact that 
	more events mean a more stable loss shape, so 
	the minimiser used in \zfit{} performs better.
	\item The GPU is highly efficient in computing 
	thousands of events in parallel. For only a few data points though, the 
	overhead of 
	moving data back and forth dominates strongly, making it unfeasible for 
	only a
	small number of events. A continuous decrease of the time up to 10'000 in 
	the left plot of Fig. \ref{fig:gperf}
	events confirms that and even shows a drop in the computation time.
\end{itemize}

As a conclusion, the speed for very small examples around 100 events of \zfit{} 
is \textit{marginally} slower than the corresponding \roofit example. For 
larger 
fits, the speedup of \zfit{} is up to a factor of ten at around a million 
events. For more complicated fits and a small number of events, \roofit is an 
order of magnitude faster because of its currently better tuned minimiser, 
though there is ongoing work in \zfit{}. 
The GPU 
delivers in larger examples a similar performance as the multicore setup. 
Finally, not 
using the TF gradient yields only a minor penalty for large fits and can even 
be faster for small ones, experiments have 
shown that the effect of an increased time can be larger for complicated, real 
use cases and helps 
reducing the number of steps required to take by the minimiser.

\begin{figure}[tbp]

	\begin{subfigure}{.5\textwidth}
		\includegraphics[width=\textwidth]{figs/gperf/time_vs_nevents_2param_roofit.png}
		\caption{\roofit}
	\end{subfigure}%
	\begin{subfigure}{.5\textwidth}
		\includegraphics[width=\textwidth]{figs/gperf/time_vs_nevents_2param_zfit.png}
		\caption{\zfit{} CPU}
	\end{subfigure}%
	\caption{Measurement of the computation time with a sum of $n\_{gauss}$
	Gaussians and in total two free parameters. Notice that the y-scale is the 
	same for both plots but the x-axis for \zfit{} goes an order of magnitude 
	higher. Also, \zfit{} sums up to 16 Gaussians whereas \roofit only goes to 
	9.}
	\label{fig:time events}
\end{figure}

In Fig.\ref{fig:time events}, a comparison of what can be achieved within a 
certain time frame is shown. The fits scale with the number of events for 
different number of added Gaussians, having only two free parameters 
\textit{in total}. The observation matches the 
intuitive behaviour of scaling with complexity and size of the sample. Compared 
to \roofit, the fitting time of \zfit{} increases an order of magnitude less; 
note 
that both time scales end at 100 seconds. In this time, \zfit{} fits 8 
millions of events with 16 Gaussians, \roofit does half a million with 9 
Gaussians. For this setup, 8 CPUs on a shared cluster were used, leaving slight 
ambiguity about the results due to the unknown configuration and actual 
workload on the machine. However, the order of magnitude matches with the 
results in Fig. 
\ref{fig:gperf}, which were performed in a clean environment.

\begin{figure}[tbp]
	\centering
	\includegraphics[width=0.5\textwidth]{figs/gperf/time_vs_nevents_nparam_zfit.png}
	\caption{Time of a single toy in dependence of the number of events used. 
	Plotted for a sum of $n\_{gauss}$ Gaussians and two free parameters 
	\textit{per Gaussian}.}
	\label{fig:time events nparam}
\end{figure}

Scaling the number of free parameters with the number of Gaussians added is 
shown in Fig.\ref{fig:time events nparam}. We see clearly that for a large 
number of parameters
having a higher number of events can actually be more performant on an absolute 
scale, at least up 
to a certain threshold, since this reduces 
the number of steps to be taken for the minimisation.

\subsection{Angular analysis}
\label{sec:angular analysis}

In order to study the scalability of \zfit{} in a challenging, realistic 
setting, a toy 
study from an ongoing analysis involving \zfit{} is used. The example, which
consists in the angular analysis of 
$\decay{\Bz}{\Kstar(\decay{}{\Kp\pim})\lepton^+\lepton^-}$ with 
$\lepton$ 
being either $\electron$ or $\muon$, is an important legacy analysis
and the fact that it was 
implementable in \zfit{} is itself already an achievement. The model is from 
the folded 
angular 
analysis.
The folding to measure the P5' parameter as defined in \cite{Aaij:2015oid} 
is implemented. The resulting model is given by
\begin{align}
	\label{eq:kstangular}
	f_{angular}(\theta_{\kaon}, \theta_{\lepton}, \phi; F_L, A_T^{(2)}, P_5') 
	=~&\frac{3}{4} 
	(1 - 
	F_L) \sin^2\theta_{\kaon} + 
	F_L\cos^2\theta_{\kaon} 
	\nonumber \\
		  &+ \frac{1}{4}(1 - F_L)\sin^2\theta_{\kaon}\cos2\theta_{\lepton} 
		  \nonumber \\
		  &- F_L\cos^2\theta_{\kaon} \cos 2\theta_{\lepton} \\
		  &+ S_3 \sin^2\theta_{\kaon}\sin^2\theta_{\lepton}\cos 2\phi \nonumber 
		  \\
		  &+ S_5 \sin 2\theta_{\kaon}\sin\theta_{\lepton}\cos \phi \nonumber
\end{align}

with 
\begin{align*}
	S_3 &= \frac{1}{2} A_T^{(2)}(1 - F_L)(1 - F_L) \\
	S_5 &= P_5'\sqrt{F_L(1 - F_L)}.
\end{align*}

\begin{figure}[tbp]
	\centering
	\includegraphics[width=0.85\textwidth]{figs/kst_perf/SketchDecay.png}
	\caption{Schematic view of the angles of the 
	$\decay{\Bz}{\Kstar(\decay{}{\Kp\pim})\lepton^+\lepton^-}$ decay. The image 
	was taken from\cite{Khachatryan:2038750}}
	\label{fig:kst_angles}
\end{figure}

The model depends on three angles, shown in 
Fig. \ref{fig:kst_angles}, where

\begin{minipage}{\textwidth}
\begin{itemize}
	\item $\phi$ is the angle between the plane spanned by the flight
	direction of the two leptons with the plane spanned by the kaon and pion,
	\item $\theta_{\kaon}$ is the angle between the kaon and the negative 
	flight direction of the \Bz and
	\item $\theta_{\lepton}$ is the angle between the $\lepton^+$ 
	($\lepton^-$) and the negative flight direction of the \Bz.
\end{itemize}
\end{minipage}
\newline

The model is extended to four dimensions by adding a description of the \B 
invariant mass distribution by building the product with the angular part as 
defined in Eq. \ref{eq:kstangular}. 
The model used for the mass shape is a combination of 
two Gaussians, each with a powerlaw tail. 

The implementation of the angular part is straight forward and follows the 
example in Sec. \ref{sec:quickstart}. Using the implemented \pyth{DoubleCB} for 
the mass, 
the four dimensional model can be built as

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
limit_thetak = zfit.Space('thetak', limits=...)
limit_thetal = zfit.Space('thetal', limits=...)
limit_phi = zfit.Space('phi', limits=...)
limit_bmass = zfit.Space('bmass', limits=...)

angular_obs = limit_thetak * limit_thetal * limit_phi
angular = AngularPDF(obs=angular_obs, ...)
mass = zfit.pdf.DoubleCB(obs=limit_bmass, ...)

model = angular * mass
\end{python}
\end{minipage}
\end{center}


With
this model a set of toy studies is performed, varying in each of them the 
number of 
events while keeping the number of free parameters fixed to nine.

\begin{figure}[tbp]
	\begin{subfigure}{.5\textwidth}
		\includegraphics[width=\textwidth]{figs/kst_perf/loglog_plot.png}
		\caption{Double log scale}
	\end{subfigure}%
	\begin{subfigure}{.5\textwidth}
		\includegraphics[width=\textwidth]{figs/kst_perf/plot_normal.png}
		\caption{Linear scale}
	\end{subfigure}%
	\caption{The figures show the time needed per toy for the four dimensional 
	P5' folded angular distribution. 25 toys are produced for each number of 
	events.}
\label{fig:kst_angular_performance}
\end{figure}

Fig. \ref{fig:kst_angular_performance} shows the performance of the toys on a  
shared cluster with 8 CPUs requested and we can see that the
time per toy increases slightly sublinearly with the number of events. 
An interesting number is the average time for toys with around 1000 events, the 
expected number of events to be seen in the data and therefore actually used in 
the real toy studies,
which is around one second per 
toy.
Both go well together with the expectations of a model fitting library. The 
whole example demonstrates
the suitability of \zfit{} to be used in non-trivial, real world analyses.