\section*{Appendices} \appendix \section{Likelihood} \label{appendix:likelihood} A reasonable loss is to have a quantity that expresses the \textit{probability of the model given the data}, which can then be maximised. A form of Bayes theorem about probability can be used in the setting of a model parametrised under $\theta$ and data x to express the above \begin{equation} \label{eq:bayes theorem} P(\theta | x ) = \frac{P(x | \theta) P(\theta)}{P(x)} \end{equation} It states that \textit{the probability of $\theta$ under our observed data equals to the likelihood of finding our data given $\theta$ times the prior of $\theta$ over the prior of our data}. While $P(data | \theta)$ is a probability, fixing the data and varying $\theta$ is called the likelihood of $\theta$ and denoted as $\mathcal{L}(\theta)$. If there is no previous knowledge about $\theta$, we can assume a uniform prior, this is usually the case. The same is done for the data, whose marginal probability acts as a normalising constant. This means they both introduce constants into our term since they do not depend on our parametrization $\theta$. Note that the following derivations rely on a discrete probability distribution as opposed to a continuous one. This simplifies the argumentation. It can be shown that the results are equivalently valid for the latter, therefore no strict distinction is made. We start out with a form of Bayes theorem, which states that the combined probability of two events are independent of their order. \begin{equation} \label{eq:combined_prob} P(A \cap B) = P(B \cap A) \end{equation} The probability of two events to happen can be rewritten in terms of the conditional $P(A | B)$, read as \textit{the probability of A given B}, and the marginal probability $P(B)$ as \begin{equation} \label{eq:combined_prob_rewritten} P(A \cap B) = P(A | B) P(B) \end{equation} Combining \ref{eq:combined_prob} with \ref{eq:combined_prob_rewritten} and rearranging we get \begin{equation} P(A | B) = \frac{P(B | A) P(A)}{P(B)} \end{equation} that is usually in this form famously known as Bayes theorem. It states that \textit{the probability of A under B equals the likelihood of B given A times the prior of A over the prior of B}. While $P(B | A)$ is a probability, fixing B and varying A is called the likelihood of A and denoted as $\mathcal{L}(A)$. If there is no previous knowledge about A, we can assume a uniform prior. The same is done for B, whose marginal probability usually acts as a normalizing constant. This means they both introduce constants into our term and are therefore not of interest later on. They will be left away from now on \begin{equation} \mathcal{L}(B) = P(A | B) \end{equation} Given a hypothesis $H_0$ and a dataset x, the likelihood is the probability that an event happened under a certain hypothesis \begin{equation} \mathcal{L}(H_0) = P(x | H_0). \end{equation} While the probability itself is normalized \text{over $x$}, a likelihood is not. Notice that the likelihood of $H_0$ is a function in $H_0$, namely the probability of $x$ under $H_0$ \textit{with $H_0$ changing}. So the difference between likelihood and probability is the distinction of what is the free parameter and what is the parametrization. We assume now our hypothesis is described by a model parametrised with $\theta$, a PDF as defined in \ref{eq:pdf}. There is actually a distinction between parameters of interest (POI) and nuisance parameters when speaking of parametrisation. The latter describe parts of the models shape but are not of direct interest in the sense that they do not appear in the hypothesis but rather describe well known effects like the width of some smearing. Since they also have to be inferred in the same way, we restrict ourselves in the derivation of the likelihood to models with exclusively POI. The distinction becomes apparent later in the hypothesis testing, which goes beyond the scope of this thesis. While the likelihood denotes the odds of the model parametrised with $\theta$ under the observations x, it is, as seen above, equal to the probability density of finding x given the parametrisation $\theta$ \begin{equation} \mathcal{L}(\theta | x) = f_{\theta} (x) \end{equation} Since the likelihood under data x is the product of likelihoods under each independent data point, we can write the likelihood as a product of probability densities as in \ref{eq:likelihood_from_products} Using this expression, we can get a maximum likelihood estimate of our parametrisation $\theta$ \begin{equation} \hat{\theta} = argmax (\mathcal{L}(\theta | x)) \end{equation} In practice, finding the maximum is done using numerical methods. Eq. \ref{eq:likelihood_from_products} involves the product of many small numbers $<$1, not feasible for most computers given their limited precision on numbers. The monotony of the logarithm can be used to transform the probability densities to log densities whereby the multiplication becomes an addition \begin{equation} argmax(f(\theta|x)) = argmax(ln(f(\theta|x))) \end{equation} It is often more convenient to find a minimum instead of a maximum, so that our likelihood takes the form as in \ref{eq:nll} which is our NLL. Therefore, our likelihood estimation becomes \begin{align} \label{eq:loglikelihood_from_sums} \hat{\theta} = argmin (- \sum_{i}^{n} ln(f(\theta|x_i))) \end{align} This estimation is valid for $n$ measured data points with weight one. We can generalize this expression, including event weights. They introduce a power factor but can be taken out of the logarithm to have a simple multiplication \begin{align} \label{eq:loglikelihood_from_sums_weighted} \hat{\theta} = argmin (- \sum_{i}^{n} w_i \cdot ln(f(\theta|x_i))) \end{align} with $w_i$ being the weight of the $i^{th}$ event. This generalization is not only useful when using data points directly to build an unbinned NLL but allows straightforward to bin the data in the first place and use the bin height as the event weights by choosing an appropriate $x_i$ for each bin. Binning can speed up the computation significantly but looses slight precision\footnote{ Currently, only unbinned losses are implemented in \zfit{} but the extension to binned is already tested and will be there in the future.}. We considered now the likelihood of a model fit to a dataset. The same parameters often occur in more than one model, for example the invariant mass of a mother particle. To take this into account, a simultaneous fit can be performed, which is simply the multiplication of several likelihoods. \begin{align} \label{eq:simultaneous_likelihood} \mathcal{L}_{f(x)}(\theta | {data_0, data_1, ..., data_n}) &= \prod_{i} \mathcal{L}(\theta_i, data_i) \end{align} where $\theta_i$ is a subset of parameters of $\theta$ used for the model fit to $data_i$. The likelihood can be multiplied by constraint terms, most notably an extended likelihood estimator (EML) and parameter constraints. If a parameter changes both shape and the overall normalization of the pdf, an EML fit is superior. This situation typically arises if there is a sum of several shapes as when adding the background and the signal shape. So for example, taking the model $N_{sig} * PDF_{sig} + N_{bkg} *PDF_{bkg}$ and assuming a Poisson distribution of the number of events in the data, we can multiply the likelihood by a term \begin{align} \label{eq:extended_likelihood} \mathcal{L}_{extended} &= poiss(N_{tot}, N_{data}) \\ &= N_{data}^{N_{tot}} \frac{e^{- N_{data}}}{N_{tot}!} \end{align} where $N_{tot} = N_{sig} + N_{bkg}$. Therefore the EML looks like \begin{align} \label{eq:loglikelihood_from_sums_weighted_extended} \hat{\theta} = argmin (- \sum_{i} \cdot ln(f_i(\theta|data_i)) - \sum_{i} ln(poiss(N_i, N_{data_i})) \end{align} where $data_i$ are, as before, different datasets. The weights are inside $data_i$ and taken into account as described above. The total number of events is, in general, the sum of the weights of all the events. For certain parameters, prior knowledge is available. If the order of magnitude of the knowledges uncertainty is within the expected fitting sensitivity, a constraint term can be used to incorporate this. This is nothing else than an additional term the likelihood is multiplied with \begin{align} \label{eq:constraints} \mathcal{L}(\theta) &= \mathcal{L}_{unconstrained} \prod_{i} f_{constr_i}(\theta) \\ &= \mathcal{L}_{unconstrained} \mathcal{L}_{constr} \end{align} as an example, a parameter $\theta_i$ that is Gaussian constraint with $\mu_{constr}$ and $\sigma{constr}$ looks like this \begin{align} constr_i = Gauss(\theta_i; \mu_{constr}, \sigma{constr}) \end{align} Combining equations \ref{eq:simultaneous_likelihood}, \ref{eq:extended_likelihood} and \ref{eq:constraints} yields for the likelihood in general \begin{equation} \label{eq:total_likelihood} \mathcal{L} = \mathcal{L}_{f(x)} \cdot \mathcal{L}_{extended} \cdot \mathcal{L}_{constr} \end{equation} The absolute magnitude of the likelihood itself does not have any specific meaning, but ratios and scanning over the parameter space of $\theta$ allow for statistical interpretation. This however goes beyond the scope of \zfit{} and is intentionally left to other packages like lauztat\cite{software:lauztat}. As mentioned in the beginning, a likelihood as described in \ref{eq:total_likelihood} is not the only possibility to define a loss, though the most common used one. A prominent example is the $\chi^2$ loss, which is equivalent to a likelihood in the limiting case if each point is normally distributed. \section{Backend} Choosing the right computational backend for model fitting is crucial. In the following Sections, the different paradigms and backend designs are introduced. Furthermore, the TF library and how it is wrapped inside \zfit{} is explained. \subsection{HPC and paradigms} \label{appendix:paradigms} High performance computing (HPC) solves an entirely different problem than high level programming languages do. While the flexibility offered by Python is nearly unlimited and an incredible strong and comfortable feature, it is only possible due to the dynamic nature of the language. For the interpreter of the language that actually translates and executes the code to machine level, nearly no assumptions can be made about any object. Neither about their size or type nor about their future in the code. This means that no previous optimization is possible. For HPC this is the crucial key to success: the more is known \textit{previously} of the actual computation, the better the performance will be. The less flexibility is available, the more assumptions can be made. This way the layout of arrays, parallelisation of computation, caching and more can be efficiently achieved. Since less flexibility is not an overall desired characteristics, a good compromise has to be found. The distinction between two fundamental paradigm are of importance \begin{description} \item[Imperative] Most code is run imperatively: a statement is executed when the line is hit. Python, C++ and many more work that way. The advantage is that the state changes as the code runs. An addition of two numbers for example will be executed right when the command appears. The disadvantage is that it's impossible to optimize over more than one step, since the next line of code is not known yet. \item[Declarative/Graph] When a statement is supposed to be executed, it is \textit{actually} not yet run but somehow remembered. It runs either explicitly when asked for it or implicitly if it is needed as a dependency. The latter could also be described as ``lazy evaluation". The disadvantage is that the state of the code may looks quite undetermined since for example an addition of two numbers is not yet actually executed when the line is hit. Instead, an object that actually calculates the computation is usually returned. The paradigm can also be described as a ``graph" based approach, since an execution graph is built. In the special case of HPC, this can be a computational graph displaying data input, operations and outputs. While a declarative paradigm is better in terms of raw performance and optimizations in general, it is also more limited in functionality and cannot offer the flexibility that the imperative paradigm offers. The workflow is divided into \textit{compilation} or \textit{graph construction} time and \textit{run} time. \end{description} In reality, a mixture of this approaches is usually applied on different levels which some languages and frameworks tending strongly toward one or the other. For example, any ahead-of-time compilation as used in a C/C++ compiler is kind of a declarative step. But the language C/C++ itself is imperative. For deep learning frameworks, there is a strong tendency towards a declarative paradigm, more specific a graph is built. To ease the difficulty of debugging and making experimentation simpler, an immediate execution of a graph can make a graph based approach behave imperatively. Some frameworks therefore offer a mix with this immediate execution. For the maximal performance, a graph based approach will though be superior for decent complex problems. There are three major advantages of using a graph. \begin{description} \item[Distributed computing] Moore's law states since over 40 years that the processing power, which originally equals to the clock speed, of CPUs will double approximately every 1.5 years. While there is no theoretical foundation for this statement, reality holds up to it. Since the beginning of the 2000's, CPUs have reached a clock speed around gigahertz and started hitting a limit: the thermal dissipation of a CPU goes with the third power of the clock speed. As a consequence most CPUs today don't have a higher clock rate in order to stay efficient. Nonetheless, CPU power keeps increasing at a similar rate by using more than one CPU unit in parallel. Multicore systems became common on almost every machine. It is an imperative therefore that any numerical intensive library makes a good use of distributed computing. This includes shared memory machines to large scale clusters with multiple nodes. Moreover, with the recent success of ever growing DNNs and frameworks built to make easy use of them, vectorial computing devices like GPUs can be used easily as an alternative to CPUs. \item[Mathematical optimization] Since there is not only one step for each computation available but the whole computation at once, the possibility for mathematical optimization arise. Most notable, an analytic expression for the derivative is available through automatic differentiation. This subsequently applies the chain rule to any operation. \item[Caching] A throughout analysis of the computation graph allows for various computational optimizations. Common sub-expression elimination identifies if the same operation is executed in several places and substitutes the nodes for a single computation. \end{description} The disadvantage of building a graph first is the overhead to build, analyse and store the graph. And the additional indirectness, since the main principle is to build the computation graph once and run it several times. This implies a non-intuitive behaviour for users coming from an imperative background. While the graph based approach seems very feasible, there are two fundamental different assumptions that need be made with graphs. \begin{description} \item[Static] A static graph is built once and \textit{cannot} be altered. So if we have for example a complicated function taking as input a uniform value generator, we build the graph once and execute it several times. In order to change the input to a normal distributed random generator, this requires to build the whole graph again with one operation, the random input generator, changed. \item[Dynamic] Dynamic graphs are mutable and can be used if there are a lot of actual modifications to the graph. Any part can be changed, its not restricted to append-only. \end{description} As an example, TensorFlow offers a static graph while PyTorch uses a dynamic graph. \subsection{Working with TensorFlow} \label{appendix:tensorflow} Graph based approaches imply another level of indirectness and need some wrapping in order to avoid unexpected or inefficient behaviour. In \zfit{} this is mostly hidden, exposing some of it but also removing the obstacles for most use cases. As an example, we'll look at a task of several random number generations in a row. In the imperative style this would look like this, where numpy is used to generate a uniform distribution \begin{center} \begin{minipage}{\textwidth} \begin{python} for _ in range(n_samples): sample = np.random.uniform(...) # do something with the sample \end{python} \end{minipage} \end{center} In a declarative approach as with TF, the same is achieved by \textit{first} building the actual operation of the graph and \textit{then} executing the computation \begin{center} \begin{minipage}{\textwidth} \begin{python} sample_op = tf.random.uniform(...) for _ in range(n_samples): sample = zfit.run(sample_op) # do something with the sample \end{python} \end{minipage} \end{center} where \pyth{zfit.run(...)} is just the command to execute a certain part of the graph. The object \pyth{sample} is in both cases a numpy array. Note that in the declarative approach, we created the operation only once but \textit{execute} it multiple times. This indirection can be surprising for the unaware and a typical mistake is to implement it this way \begin{center} \begin{minipage}{\textwidth} \begin{python} # BAD EXAMPLE for _ in range(n_samples): sample_op = tf.random.uniform(...) sample = zfit.run(sample_op) # do something with the sample \end{python} \end{minipage} \end{center} which just fills up the graph with additional operations that actually all do the same thing. In order to hide this inconvenience to the user, a caching system is in place in \zfit{} to prevent an accidental re-creation of the operation. In the above example, we can think of wrapping a function \pyth{uniform} (\textit{fictive example}), create the operation on the first call and use the cached operation afterwards. In pseudo code, this would look like the following \begin{center} \begin{minipage}{\textwidth} \begin{python} cached_op = None def uniform(...): global cache_op # make "cache_op" assignable if cached_op is None: cached_op = tf.random.uniform(...) # create op return cache_op \end{python} \end{minipage} \end{center} This function can then be used a mixed stile \begin{center} \begin{minipage}{\textwidth} \begin{python} for _ in range(n_samples): sample_op = uniform(...) # as created above sample = zfit.run(sample_op) # do something with the sample \end{python} \end{minipage} \end{center} This hides some of the difficulties. While this example of just random number generation may seems artificial, a prominent example is the loss that is typically built using models and data. Calling \pyth{value} builds the operation, adds it to the graph and stores it, so that in subsequent calls, the stored operation is returned instead of a new one created. While this is very convenient for the user combining the good of two worlds, it comes with an additional burden for \zfit{} of having to cache the operations efficiently. \subsubsection{Caching} \label{appendix:caching} In HPC, sometimes certain parts of a computation do not need to be recomputed and storing the results in a cache for later reuse can improve the performance significantly. Since \zfit{} uses a declarative graph based approach, there are two different caches: For the operations built on construction time as discussed above and only appearing due to the graph based approach. Furthermore, intermediate calculations can be cached in general, which corresponds to the objects built on run time. The latter is commonly used also in imperative approaches. As seen before, creating operations with a declarative approach adds each of them to a global graph, a collection of operations. And since these are only instructions and not actual computations, there is no need to rebuild the operations and stitching them together again but rather re-use the previously built instruction. Therefore \zfit{} has a possibility to cache Tensors after they have been built for the first time inside the object that the Tensor was created in. Since the graph is static and cannot change, an even small modification to a part of it requires a complete rebuild\footnote{There are ways of changing the \textit{value} but not the logic, the computation flow.}. This can be as less as removing an single, additional term in the model. Any object that may needs to perform such a modification is therefore registered within the caching object and can notify the cacher in order to invalidate its cache and rebuild any Tensor. Numerical results of computations inside the graph can remain the same during several executions of an operation which can therefore be cached. Opposed to the construction time caching, numerical caches do also invalidate if a number inside the graph changes, not only its structure. Caching in a declarative approach is not straightforward since this implies to replace a node, which can be a large subgraph by itself, with a single value. Since the graph cannot be changed, this would require a complete rebuild of the graph, rendering the caching way less efficient. There are two ways to circumvent this problem \begin{description} \item[feed\_dict] The TF sessions \pyth{run} method offers the possibility to overwrite specific nodes with a value using a feed\_dict. This means that cached values can simply be stored and overriden on runtime. At the current stage, \zfit{} is not designed to always control the execution but leaves the freedom to the user to invoke the \pyth{Session.run} method themselves. \item[Variable] Since all TF object so far are just operations but not numbers, storing a value between the runtimes requires a special object. A TF \pyth{Variable} can be used for this case. Its purpose is to store a value but also act as a node in the graph with a read operation that return the current value. Its value can be changed between the runtimes in case it has to be updated. This though can have side effects on other object that still use this cache. The disadvantage here is that the initialization of the Variable can be tricky if it is created inside a control flow such as \pyth{tf.while} or \pyth{tf.cond}. \end{description} In \zfit{} currently a feed\_dict independent implementation is being tested, but that is likely to change in the future. One main problem with this kind of caching is the gradient. If a value is not supposed to change, then its simple since there is no gradient. But if there is a value that \textit{can} change, the gradients value also needs to be cached. A likely solution is to wrap the ordinary Tensor class from TF and provide a caching Tensor, that automatically takes care of gradient caching as well. Also the feed\_dict seems like an efficient solutions, at least for cases without a gradient.