Master_thesis/thesis/appendix.tex at c134fb6539b3784e48504f1927a9485a67aecfd6

Fork: 0
saslie / Master_thesis
Find file
Newer
Older
Master_thesis / thesis / appendix.tex
Sascha Liechti on 30 Sep 2019 22 KB Minor updates
Raw Blame History
\section*{Appendices}

\appendix

\section{Likelihood}
\label{appendix:likelihood}


A reasonable loss is to have a quantity that 
expresses the \textit{probability of the model given the data}, which can then 
be maximised. A
form of Bayes theorem 
about probability can be used in the setting of a model parametrised under 
$\theta$ and data x to express the above
\begin{equation}
\label{eq:bayes theorem}
P(\theta | x ) = \frac{P(x | \theta) P(\theta)}{P(x)}
\end{equation}

It states that \textit{the probability of $\theta$ under our observed data 
equals to the likelihood of finding our data given $\theta$ times the prior of 
$\theta$ over the prior of our data}. 
While $P(data | \theta)$ is a probability, fixing the data and varying $\theta$ 
is called the likelihood of $\theta$ and denoted as $\mathcal{L}(\theta)$. If 
there is no previous knowledge about $\theta$, we can assume a uniform prior, 
this is usually the case. The same is done for the data, whose marginal 
probability acts as a normalising constant. This means they both introduce 
constants into our term since they do not depend on our parametrization 
$\theta$.

Note that the following derivations rely on a discrete probability distribution as opposed to a continuous one. This simplifies the argumentation. It can be shown that the results are equivalently valid for the latter, therefore no strict distinction is made. We start out with a form of Bayes theorem, which states that the combined probability of two events are independent of their order.


\begin{equation}
\label{eq:combined_prob}
P(A \cap B) = P(B \cap A)
\end{equation}

The probability of two events to happen can be rewritten in terms of the conditional $P(A | B)$, read as \textit{the probability of A given B}, and the marginal probability $P(B)$ as

\begin{equation}
\label{eq:combined_prob_rewritten}
P(A \cap B) = P(A | B) P(B)
\end{equation}

Combining \ref{eq:combined_prob} with \ref{eq:combined_prob_rewritten} and rearranging we get

\begin{equation}
P(A | B) = \frac{P(B | A) P(A)}{P(B)}
\end{equation}

that is usually in this form famously known as Bayes theorem. It states that \textit{the probability of A under B equals the likelihood of B given A times the prior of A over the prior of B}. While $P(B | A)$ is a probability, fixing B and varying A is called the likelihood of A and denoted as $\mathcal{L}(A)$. If there is no previous knowledge about A, we can assume a uniform prior. The same is done for B, whose marginal probability usually acts as a normalizing constant. This means they both introduce constants into our term and are therefore not of interest later on. They will be left away from now on

\begin{equation}
\mathcal{L}(B) = P(A | B)
\end{equation}

Given a hypothesis $H_0$ and a dataset x, the likelihood is the probability 
that an event happened under a certain hypothesis

\begin{equation}
\mathcal{L}(H_0) = P(x | H_0).
\end{equation}

While the probability itself is normalized \text{over $x$}, a likelihood is 
not. Notice that 
the likelihood of $H_0$ is a function in $H_0$, namely the probability of $x$ 
under 
$H_0$ \textit{with $H_0$ changing}. So the difference between likelihood and 
probability is the distinction of what is the free parameter and what is the 
parametrization. We 
assume now our hypothesis 
is described by a model parametrised with $\theta$, a PDF as defined in 
\ref{eq:pdf}.

There 
is actually a distinction between parameters of interest (POI) and nuisance 
parameters when speaking of parametrisation. 
The latter describe parts of the models shape but are not of direct interest in 
the sense that they do not appear in the hypothesis but rather describe well 
known 
effects like the width of some smearing. Since they also have to be inferred in 
the same way, we 
restrict ourselves in the derivation of the likelihood to models with 
exclusively POI. The distinction becomes apparent later in the hypothesis 
testing, which goes beyond the scope of this thesis.

While the likelihood denotes the odds of the model parametrised with $\theta$ under the observations x, it is, as seen above, equal to the probability density of finding x given the parametrisation $\theta$
\begin{equation}
\mathcal{L}(\theta | x) = f_{\theta} (x)
\end{equation}

Since the likelihood under data x is the product of likelihoods under each independent data point, we can write the likelihood as a product of probability densities as in \ref{eq:likelihood_from_products}

Using this expression, we can get a maximum likelihood estimate of our parametrisation $\theta$

\begin{equation}
\hat{\theta} = argmax (\mathcal{L}(\theta | x))
\end{equation}

In practice, finding the maximum is done using numerical methods. Eq. 
\ref{eq:likelihood_from_products} involves the product of many small numbers 
$<$1, not feasible for most computers given their limited precision on numbers. 
The monotony of the logarithm can be used to transform the probability 
densities to log densities whereby the multiplication becomes an addition
\begin{equation}
argmax(f(\theta|x)) = argmax(ln(f(\theta|x)))
\end{equation}

It is often more convenient to find a minimum instead of a maximum, so that our likelihood takes the form as in \ref{eq:nll}

which is our NLL. Therefore, our likelihood estimation becomes

\begin{align}
\label{eq:loglikelihood_from_sums}
\hat{\theta} = argmin (- \sum_{i}^{n} ln(f(\theta|x_i)))
\end{align}

This estimation is valid for $n$ measured data points with weight one. We can generalize this expression, including event weights. They introduce a power factor but can be taken out of the logarithm to have a simple multiplication

\begin{align}
\label{eq:loglikelihood_from_sums_weighted}
\hat{\theta} = argmin (- \sum_{i}^{n} w_i \cdot ln(f(\theta|x_i)))
\end{align}

with $w_i$ being the weight of the $i^{th}$ event. This generalization is not only useful when using data points directly to build an unbinned NLL but allows straightforward to bin the data in the first place and use the bin height as the event weights by choosing an appropriate $x_i$ for each bin. Binning can speed up the computation significantly but looses slight precision\footnote{
Currently, only unbinned losses are implemented in \zfit{} but the extension to binned is already tested and will be there in the future.}.

We considered now the likelihood of a model fit to a dataset. The same parameters often occur in more than one model, for example the invariant mass of a mother particle. To take this into account, a simultaneous fit can be performed, which is simply the multiplication of several likelihoods.

\begin{align}
\label{eq:simultaneous_likelihood}
\mathcal{L}_{f(x)}(\theta | {data_0, data_1, ..., data_n}) &= \prod_{i} \mathcal{L}(\theta_i, data_i)
\end{align}

where $\theta_i$ is a subset of parameters of $\theta$ used for the model fit to $data_i$.

The likelihood can be multiplied by constraint terms, most notably an extended likelihood estimator (EML) and parameter constraints.

If a parameter changes both shape and the overall normalization of the pdf, an EML fit is superior. This situation typically arises if there is a sum of several shapes as when adding the background and the signal shape. So for example, taking the model $N_{sig} * PDF_{sig} + N_{bkg} *PDF_{bkg}$ and assuming a Poisson distribution of the number of events in the data, we can multiply the likelihood by a term

\begin{align}
\label{eq:extended_likelihood}
\mathcal{L}_{extended} &= poiss(N_{tot}, N_{data}) \\ 
&= N_{data}^{N_{tot}} \frac{e^{- N_{data}}}{N_{tot}!}
\end{align}

where $N_{tot} = N_{sig} + N_{bkg}$. Therefore the EML looks like

\begin{align}
\label{eq:loglikelihood_from_sums_weighted_extended}
\hat{\theta} = argmin (- \sum_{i} \cdot ln(f_i(\theta|data_i)) - \sum_{i} ln(poiss(N_i, N_{data_i}))
\end{align}

where $data_i$ are, as before, different datasets. The weights are inside $data_i$ and taken into account as described above. The total number of events is, in general, the sum of the weights of all the events.

For certain parameters, prior knowledge is available. If the order of magnitude of the knowledges uncertainty is within the expected fitting sensitivity, a constraint term can be used to incorporate this. This is nothing else than an additional term the likelihood is multiplied with

\begin{align}
\label{eq:constraints}
\mathcal{L}(\theta) &= \mathcal{L}_{unconstrained} \prod_{i} f_{constr_i}(\theta) \\
&= \mathcal{L}_{unconstrained} \mathcal{L}_{constr}
\end{align}

as an example, a parameter $\theta_i$ that is Gaussian constraint with $\mu_{constr}$ and $\sigma{constr}$ looks like this

\begin{align}
constr_i = Gauss(\theta_i; \mu_{constr}, \sigma{constr})
\end{align}

Combining equations \ref{eq:simultaneous_likelihood}, \ref{eq:extended_likelihood} and \ref{eq:constraints} yields for the likelihood in general

\begin{equation}
\label{eq:total_likelihood}
\mathcal{L} = \mathcal{L}_{f(x)} \cdot \mathcal{L}_{extended} \cdot \mathcal{L}_{constr}
\end{equation}

The absolute magnitude of the likelihood itself does not have any specific 
meaning, but ratios and scanning over the parameter space of $\theta$ allow 
for statistical interpretation. This however goes beyond the scope of \zfit{} 
and 
is intentionally left to other packages like lauztat\cite{software:lauztat}.

As mentioned in the beginning, a likelihood as described in 
\ref{eq:total_likelihood} is not the only possibility to define a loss, though 
the most common used one. A prominent example is the $\chi^2$ loss, which is 
equivalent to a likelihood in the limiting case if each point is normally 
distributed.

\section{Backend}

Choosing the right computational backend for model fitting is crucial. In the 
following Sections, the different paradigms and backend designs are introduced. 
Furthermore, the TF library and how it is wrapped inside \zfit{} is explained.

\subsection{HPC and paradigms}
\label{appendix:paradigms}

High performance computing (HPC) solves an entirely different problem than high 
level programming languages do. While the flexibility offered by Python is 
nearly 
unlimited and an incredible strong and comfortable feature, it is only possible 
due to the dynamic nature of the language. For the interpreter of the language 
that actually translates and executes the code to machine level, nearly no 
assumptions can be made 
about any object. Neither about their size or type nor about their future in 
the code. 
This means that no previous optimization is possible. For HPC this is the 
crucial key to success: the more is known \textit{previously} of the actual 
computation, the better the performance will be. The less flexibility is 
available, the more 
assumptions can be made. This way the layout of arrays, parallelisation of 
computation, caching and more can be efficiently achieved. Since less 
flexibility is not an overall desired characteristics, a good compromise has to 
be found. The distinction between two fundamental paradigm are of importance

\begin{description}
	\item[Imperative] Most code is run imperatively: a statement is executed 
	when the line is hit. Python, C++ and many more work that way. The 
	advantage is that the state changes as the code runs. An addition of two 
	numbers for 
	example will be executed right when the command appears. The disadvantage 
	is that it's impossible to optimize over more than one step, since the next 
	line of code is not known yet.
	
	\item[Declarative/Graph] When a statement is supposed to be executed, it 
	is \textit{actually} not yet run but somehow remembered. It runs 
	either explicitly when asked for it or implicitly if it is needed as a 
	dependency. The latter could also be described as ``lazy evaluation". The 
	disadvantage is that the state of the code may looks quite undetermined 
	since for 
	example an addition of two numbers is not yet actually executed when the 
	line is hit. 
	Instead, an object that actually calculates the computation is usually 
	returned.
	The paradigm can also be described as a ``graph" based approach, since an 
	execution graph is built. In the special case of HPC, this can be a 
	computational graph displaying data input, operations and outputs. While a 
	declarative paradigm is better in terms of raw performance and 
	optimizations in general, it is also more limited in functionality and 
	cannot offer the flexibility that the imperative paradigm offers.
	The workflow is divided into \textit{compilation} or \textit{graph 
		construction} time and \textit{run} time.
\end{description}

In reality, a mixture of this approaches is usually applied on different levels 
which some languages and frameworks tending strongly toward one or the other. 
For example, any ahead-of-time compilation as used in a C/C++ compiler is kind 
of a declarative step. But the language C/C++ itself is imperative.
For deep learning frameworks, there is a strong tendency towards a declarative 
paradigm, more specific a graph is built. To ease the difficulty of debugging 
and making experimentation simpler, an immediate execution of a graph can make 
a graph based approach behave imperatively. Some frameworks therefore offer a 
mix with this immediate execution. For the 
maximal performance, a 
graph based approach will though be superior for decent complex problems.
There are three major advantages of using a graph.

\begin{description}
	\item[Distributed computing] Moore's law states since over 40 years that 
	the processing power, which originally equals to the clock speed, of CPUs 
	will double approximately every 1.5 years. While there is no theoretical 
	foundation for this statement, reality holds up to it. Since the beginning 
	of the 2000's, CPUs have reached a clock speed around gigahertz and started 
	hitting a limit: the thermal dissipation of a CPU goes with the third 
	power of the clock speed. As a consequence most CPUs today don't have a 
	higher clock rate in order to stay efficient. Nonetheless, CPU power keeps 
	increasing at a similar rate by using more than one CPU unit in parallel. 
	Multicore systems became common on almost every machine. It is an 
	imperative therefore that any numerical intensive library makes a good use 
	of distributed computing. This includes shared memory machines to large 
	scale clusters with multiple nodes. Moreover, with the recent success of 
	ever growing DNNs and frameworks built to make easy 
	use of them, vectorial computing devices like GPUs can be used easily as an 
	alternative to CPUs.
	\item[Mathematical optimization] Since there is not only one step for each 
	computation available but the whole computation at once, the possibility 
	for mathematical optimization arise. Most notable, an analytic expression 
	for the derivative is available through automatic differentiation. This 
	subsequently applies the chain rule to any operation.
	\item[Caching] A throughout analysis of the computation graph allows for 
	various computational optimizations. Common sub-expression elimination 
	identifies if the same operation is executed in several places and 
	substitutes the nodes for a single computation. 
\end{description}

The disadvantage of building a graph first is the overhead to build, analyse 
and store the graph. And the additional indirectness, since the main principle 
is to build the computation graph once and run it several times. This implies a 
non-intuitive behaviour for users coming from an imperative background.

While the graph based approach seems very feasible, there are two fundamental 
different assumptions that need be made with graphs.
\begin{description}
	\item[Static] A static graph is built once and \textit{cannot} be altered. 
	So if we have for example a complicated function taking as input a 
	uniform value generator, we build the graph once and execute it several 
	times. In order to change the input to a normal distributed random 
	generator, 
	this requires to build the whole graph again with one operation, the random 
	input generator, changed.
	\item[Dynamic] Dynamic graphs are mutable and can be used if there are a 
	lot of actual modifications to the graph. Any part can be changed, its not 
	restricted to append-only.
\end{description}

As an example, TensorFlow offers a static graph while PyTorch uses a dynamic 
graph.

\subsection{Working with TensorFlow}
\label{appendix:tensorflow}

Graph based approaches imply another level of indirectness and need some 
wrapping in order to avoid unexpected or inefficient behaviour. In \zfit{} this 
is mostly hidden, exposing some of it but also removing the obstacles for most 
use cases.

As an example, we'll look at a task of several random number generations in a 
row. In the imperative style this would look like this, where numpy is used to 
generate a uniform distribution

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
for _ in range(n_samples):
	sample = np.random.uniform(...)
	# do something with the sample
\end{python}
\end{minipage}
\end{center}

In a declarative approach as with TF, the same is achieved by 
\textit{first} building the actual operation of the graph and \textit{then} 
executing the computation

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
sample_op = tf.random.uniform(...)
for _ in range(n_samples):
	sample = zfit.run(sample_op)
	# do something with the sample
\end{python}
\end{minipage}
\end{center}

where \pyth{zfit.run(...)} is just the command to execute a certain part of the 
graph. The object \pyth{sample} is in both cases a numpy array. Note that in 
the declarative approach, we created the operation only once but 
\textit{execute} it multiple times. This indirection 
can be surprising for the unaware and a typical 
mistake is to implement it this way

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
# BAD EXAMPLE
for _ in range(n_samples):
	sample_op = tf.random.uniform(...)
	sample = zfit.run(sample_op)
	# do something with the sample
\end{python}
\end{minipage}
\end{center}

which just fills up the graph with additional operations that actually all do 
the same thing. In order to hide this inconvenience to the user, a 
caching system is in place in \zfit{} to prevent an accidental re-creation of 
the operation. 
In the above example, we can think of wrapping a function \pyth{uniform} 
(\textit{fictive example}), create the operation on the first call and use the 
cached operation afterwards. In pseudo code, this would look like the 
following

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
cached_op = None
def uniform(...):
	global cache_op  # make "cache_op" assignable
	if cached_op is None:
		cached_op = tf.random.uniform(...)  # create op
	return cache_op
\end{python}
\end{minipage}
\end{center}

This function can then be used a mixed stile

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
for _ in range(n_samples):
	sample_op = uniform(...)  # as created above
	sample = zfit.run(sample_op)
	# do something with the sample
\end{python}
\end{minipage}
\end{center}

This hides some of the difficulties. While this example of just random number 
generation may 
seems artificial, a prominent example is the loss that is typically built using 
models and data. Calling \pyth{value} builds the operation, adds it to the 
graph and stores it, so that in subsequent calls, the stored operation is 
returned 
instead of a new one created. While this is very convenient for the user 
combining the good of two worlds, it comes with an additional burden for 
\zfit{} of having 
to cache the operations efficiently.


\subsubsection{Caching}
\label{appendix:caching}

In HPC, sometimes certain parts of a computation do not need to be recomputed 
and storing the results in a cache for later reuse can improve the performance 
significantly. Since \zfit{} uses a declarative graph based approach, there are 
two different caches: For the operations built on construction time as 
discussed above and only appearing due to the graph based approach. 
Furthermore, intermediate calculations can be cached in general, which 
corresponds to the objects built on run time. The latter is commonly used also 
in imperative approaches.

As seen before, creating operations with a declarative approach adds each 
of them to a global graph, a collection of operations. And since these are only 
instructions and not actual computations, there is no need to rebuild the 
operations and stitching them together again but rather re-use the previously 
built instruction. 

Therefore \zfit{} has a possibility to cache Tensors after they have been built 
for the first time inside the object that the Tensor was created in. Since the 
graph is static and cannot change, an even small modification to a part of it 
requires a complete rebuild\footnote{There are ways of changing the 
\textit{value} but not the logic, the computation flow.}. This can be as less 
as removing an single, 
additional term in the model. Any object that may needs to perform such a 
modification is therefore 
registered within the caching object and can notify the cacher in order to 
invalidate its cache and rebuild any Tensor.

Numerical results of computations inside the graph can remain the same during 
several executions of 
an operation which can therefore be cached. Opposed to the construction time 
caching, 
numerical caches do also invalidate if a number inside the graph changes, not 
only its structure. Caching in a declarative approach is not straightforward  
since this implies to replace a node, which can be a large subgraph by itself, 
with a 
single value. Since the graph cannot be changed, this would require a complete 
rebuild of the graph, rendering the caching way less efficient. There are two 
ways to circumvent this problem

\begin{description}
	\item[feed\_dict] The TF sessions \pyth{run} method offers the possibility 
	to 
	overwrite specific nodes with a value using a feed\_dict. This means that 
	cached values can simply be stored and overriden on runtime. 
	At the current stage, \zfit{} is not designed to always control the 
	execution 
	but leaves the freedom to the user to invoke the \pyth{Session.run} method 
	themselves.
	\item[Variable] Since all TF object so far are just operations but not 
	numbers, storing a value between the runtimes requires a special object. A 
	TF 
	\pyth{Variable} can be used for this case. Its purpose is to store a value 
	but also act as a 
	node in the graph with a read operation that return the current value. Its 
	value can be changed between the runtimes in case it has to be 
	updated. This though can have side effects on other object that still use 
	this cache.
	The disadvantage here is that the initialization of the Variable can be 
	tricky if it is created inside a control flow such as \pyth{tf.while} or 
	\pyth{tf.cond}.
\end{description} In \zfit{} currently a feed\_dict independent implementation 
is being tested, but that is likely to change in the future. One main problem 
with this kind of caching is the gradient. If a value is not
supposed to change, then its simple since there is no gradient. But if there is 
a value that \textit{can} change, the gradients value also needs to be cached. 
A likely solution is to wrap the ordinary Tensor class 
from TF and provide a caching Tensor, that automatically takes care of gradient 
caching as well. Also the feed\_dict seems like an efficient solutions, at 
least for cases without a gradient.