\section{Performance studies} \subsection{Hardware specification} \label{appendix:hardware specs} The measurements are either performed on CPUs only or on an additional GPU. The following hardware and software stack has been used for Fig. \ref{fig:gperf}. \begin{description} \item[CPU] 12 core Intel i7 8850H with 2.60 GHz, 6 cores are virtual using hyper-threading. The available shared memory is 32 GB RAM, which was never even half-way filled. While hyper-threading can be very useful for applications where the bottleneck is not the actual computation by the CPU, for HPC this is often not the case. As the experiments have shown, there is only a minor difference between using 6 or 12 cores. Therefore, only 6 cores, the physical ones, are used in order to quantify the speedup correctly. \item[GPU] Mobile Nvidia P1000 with 4GB RAM. It contains the same processing unit as the consumer GTX 1050 series but is for professional usage and performs more efficient float64 computations. \end{description} It is notable that the price of the GPU and the CPUs are roughly the same, which allows for some kind of comparison between them. For Fig. \ref{fig:time events}, a cluster server with varying hardware was used. Eight cores were requested for the studies, though the workload of other jobs and the CPU type may have an impact on the results. The tests were performed with the TensorFlow version 1.13. It was pre-built by anaconda and uses the MKL library. There is also a version with the Eigen library available, tests revealed differing performances for different tasks, around a factor of two in time. For the GPU version, CUDA 10.0 with cudnn was used. \subsection{Profiling TensorFlow} \label{appendix:profiling tensorflow} Code consists of parallelised and serialised parts. While the speed of the former scales\footnote{Ideally. In reality, cores} with the number of cores, the latter does not. The total execution time $t$ is given by \begin{equation} t = \sum_i^{n_{s}} t_{s}^{(i)} + \sum_{i}^{n_{p}} (t_{p}^{(i)}/ n_{cpu} + t_{o}^{(i)}). \end{equation} where $n_s$ and $n_p$ are the number of \textit{serial} and \textit{parallel} parts, respectively. $t_s^{(i)}$ refers to the execution time of the $ith$ serial part, the $t_p^{(i)}$ for the parallel parts \text{if executed serial} and $t_o^{(i)}$ denotes the overhead that is needed for each parallel execution. The serial part consists of \begin{itemize} \item Reading in data from disk. \item Setup code such as building a model in \zfit{}. \item Global operations such as reductions on all values. For example determining whether a stopping criteria such as the sum of all gradients has gone below threshold is a serial operation. \end{itemize} while the parallel time $t_{parallel}$ contains usually the heavy computations: evaluating a function on data whereby the data can be split amongst the cores. The overhead for the parallel execution time includes \begin{itemize} \item the overhead of creating a new thread for the parallel execution. \item the time to move data between the CPUs or even to the GPU. \end{itemize} The serial time $t_{serial}$ consists of \begin{itemize} \item Bottlenecks in I/O or moving data \end{itemize} In order to achieve maximum performance and minimize $t$ \begin{itemize} \item there should be as few serial code execution time as possible. This is though heavily limited by the logic and a \textit{certain} amount will always be there. \item a minimum of splitting into serial and parallel parts should occur, since each add a constant $t_o$ term. \end{itemize} This two points are often heavily conflicting and end up with the simple equation to describe when to parallelise \begin{equation*} t_s^{(i)} - t_o^{(i)} > t_p^{(i)} / n_{cpu} \end{equation*} which reveals that even for large $n_cpu$, the overhead can be the decisive term. Furthermore, whether it is suitable to execute a piece of code serial or parallel depends on the $n_{cpu}$. Together with the difficulty of predicting the overhead time, this leaves \textit{just the decision} of whether a perfectly parallelisable piece of code actually should be run in parallel as a heuristic problem. TF uses as a strategy to find the optimal parallelisation to run a small simulation of the graph, thereby determining the overhead and the number of cores. Since TF actually executes the computations, any execution time measurement highly reflects the performance of TF for this task. As TF itself is under active development, the performance in general is expected to improve in the future. To get a reasonable estimate of what TF is capable of and somewhat avoid potential bottlenecks from \zfit{}, a dummy test function similar to a loss was written in pure TF. The function creates three times one million of random numbers and does a few operations on them before adding and reducing them to a single number. This is added to the previous calculation in a loop 100 times. There are no I/O bottlenecks and, while not as an optimal example for TF, it seems reasonable to what can be expected in model fitting. \newline \begin{table}[tbp] \begin{center} \begin{tabularx}{0.7\textwidth}{ X | X | X | X } & 1 CPU & 6 CPU & GPU \\ \hline 1 x problem & 1.0 sec & 0.27 sec & 0.093 sec \\ \hline 12 x problem & 13.4 sec & 3.3 sec & 1.0 sec \\ \hline \end{tabularx} \end{center} \caption{Execution time measurement of a loss-like function execution. The complexity of the problem is scaled by $n$ times adding the same loss again to the reduce function.} \end{table} We can see that the speedup is roughly a factor of 2/3 per core compared to the ideal case of 1. For example, the time from 1 CPU to 6 CPUs could be expected to decrease by a factor of 1/6 but does by 1/4 instead. While there are ways of building more efficient code, the example was chosen to reflect an arbitrarily, non-optimized implementation as expected to be found in \zfit{}, mostly with custom models. \subsection{Additional profiling} \label{appendix:additional profiling} Performance studies have been conducted, not shown in Sec. \ref{sec:performance}. They are displayed here. \begin{figure}[tbp] \includegraphics[width=\textwidth]{figs/gperf/loglog_9gauss_across_sampling_2param.png} \caption{Full toy study with sum of 9 Gaussians and 2 free parameters. We can see that \zfit{} s temporary bottleneck in sampling causes an extraordinary increase in execution time mostly for low number of events, but the conclusions and the overall scaling behaviour is still the same as described in Sec. \ref{sec:perf gaussian models}.} \label{fig:gperf across sampling} \end{figure}