Master_thesis/thesis/app_implementation.tex at c134fb6539b3784e48504f1927a9485a67aecfd6

Fork: 0
saslie / Master_thesis
Find file
Newer
Older
Master_thesis / thesis / app_implementation.tex
Sascha Liechti on 30 Sep 2019 27 KB Minor updates
Raw Blame History
\section{Implementation}

\subsection{Spaces definition}
\label{appendix:spaces defined}

A \zspace{} is either initialized through observables \textit{or} axes and 
\textit{maybe} also has limits. It \textit{can} have both, observables and 
axes, which means there is an order-based, bijective mapping defined between 
the two. In 
general, a \zspace{} is immutable and adding or changing anything will always 
return a copy.

When a user creates a \zspace{}, observables are used and define with that the 
coordinate system. Once a dimensional object, as a model or data, is created 
with a \zspace{}, the order of the observables matters. Since the \zspace{} at 
this point only has observables and does not yet have any axes, when the 
dimensional object is instantiated, the axis are created by filling up from 
zero to $n_{obs} - 1$. This step is crucial and defines the mapping of internal 
axes of this dimensional object to the externally used observables. In other 
words, every dimensional object has \textit{implicitly} defined axes by 
counting up to $n_{obs} -1$, assigning observables creates a mapping by 
basically 
enumerating the observables.

For example we assume a model was instantiated with a \zspace{} consisting of 
some observables and data with the same observables but in a different order. 
Now the assignment of observables to the model and the data 
columns are fixed, therefore it is well defined how the data has to be 
reordered if it is passed to the model.

While we used data and a model in the example above, the same is true for 
limits that can be used to specify the bounds of integration and more. Since 
limits can be part 
of a \zspace{}, the reordering is done automatically if the order of the 
observables or axes is changed, not in-place but in the returned copy of it.

To help with the accounting of dimensions, \zspace{} can return a copy of 
itself with differently ordered observables. Internally of the \zspace{}, the 
axes, if given, and the limits, if given, are reordered as well. This is 
crucial in input preprocessing for any dimensional object since with that each 
\zspace{} is ordered in the same way as the observables of that object.

A subspace can be created from a \zspace{}. This is a subset of the dimensions 
of the \zspace{}. 
It can be used for example if a model is composed of lower dimensional models. 
This is often the case for functors such as a product PDF as described in Sec.
\ref{model:functors}.

\subsection{General limits}
\label{appendix:general limits}

Simple limits are just tuples. But a more general format is needed to express 
multiple limits and higher dimensions in a straight forward way.
Multiple limits are technically done with multiple tuples of 
lower and upper limits for each observable. The \zspace{} is handled as 
\textit{one} domain. So the area of a \zspace{} is the sum of all the areas of 
each simple limit, the integral over a \zspace{} is the sum of integrals over 
each simple limit and so on. Multiple limits are defined separately and are not 
built from the projections of all limits. The format is therefore to specify 
the lower limits and the upper limits in each dimension as a tuple. Multiple 
limits contain multiple lower limit tuples and multiple upper limit tuples.

As an example, a \zspace{} is created with the observables 
\texttt{x, y} and the two limits $l_1$ and $l_2$

\begin{align*}
l_1 &= (x_0, y_0) ~to~ (x_1, y_1) \\
l_2 &= (x_2, y_2) ~to~ (x_3, y_3)
\end{align*}

to write the limits in the right order, the upper and lower limits have to be 
separated and concatenated, so that the lower and upper limits look like this

\begin{align*}
lower &= ((x_0, y_0), (x_2, y_2)) \\
upper &= ((x_1, y_1), (x_3, y_3))
\end{align*}

By definition of the format, $lower$ and $upper$ have to have the same length. 
The limits for the \zspace{} is the tuple $(lower, upper)$. Creating this 
\zspace{} with \zfit{} is done as follows
\newline

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
lower = ((x0, y0), (x2, y2))
upper = ((x1, y1), (x3, y3))
limits = (lower, upper)
multiple_limits = zfit.Space(obs=["x", "y"], limits=limits)
\end{python}
\end{minipage}
\end{center}

The same format is returned by the property \pyth{Space.limits}. This is a 
quite general format that covers the needs for rectangular shaped limits. 
However, more advanced shapes may be necessary, see also \ref{sec:dalitz}, 
which will most probably be provided in the future.

Since the order of observable matters and \pyth{limits_xy} and \pyth{limits_yx} 
as used in Sec. \ref{sec:spaces and dimensions} 
define the same domain (apart from the order of the axis), they can be 
converted into each other using the \pyth{Space.with_obs} method

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
limits_xy_resorted = limits_yx.with_obs(limits_xy.obs)
limits_xy == limits_xy_resorted  # -> True
\end{python}
\end{minipage}
\end{center}

Notice that this created a new \zspace{} and left \pyth{limits_yx} untouched. 
This method can also be used to only select a subspace

\begin{center}
\begin{minipage}{\textwidth}
\centering
\begin{python}
limits_x == limits_xy.with_obs("x")  # -> True
\end{python}
\end{minipage}
\end{center}

and therefore go to lower dimensions again.

\subsection{Data formats}
\label{appendix:data formats}

Currently, the following formats can be read by \zdata.

\begin{description}
	\item[ROOT] The standard file format used in HEP analysis. It efficiently 
	stores data samples and is the native format of the \root library. Due to 
	the recent development of \texttt{uproot}\cite{software:uproot}, it is 
	possible to load these files in Python \textit{without} an installation of 
	\root.
	\item[Pandas DataFrame] Pandas\cite{mckinney-proc-scipy-2010} is the most 
	extensive data container in Python used for data analysis. It provides 
	DataFrames that offer an extensive set of data analysis tools going from 
	plotting to feature creation and selections. It is the de-facto standard 
	in Python and has the ability to load from a variety of data formats 
	including hdf5, csv and more. The possibility to load them directly into 
	\zfit{} is therefore a powerful feature because it allows to do any 
	preprocessing in DataFrames. \zdata{} can 
	also be converted \textit{to} DataFrames, which allows to load for example 
	from ROOT files into \zdata{}, convert to a DataFrame, apply some 
	preprocessing steps and then load again into \zdata{}.
	\item[Numpy] Numpy\cite{software:numpy} is the standard computing library 
	in Python that has been around since a long time. Several libraries, 
	including TF, are inspired by its API and behaviour design. The numpy 
	arrays are the default way to handle any vectorized data and are also 
	returned by TF as a result of computations.
	\item[Tensors] \zdata{} can also take a pure Tensor as input. While this 
	may seem at first glance the obvious thing to do, it is trickier: a Tensor 
	is, compared to the other data types \textit{not} fixed per se, since it is 
	only an instruction to compute a certain quantity. While constants for 
	example behave straight forward and will always return the same, a 
	\zdata{} initialized with a random Tensor will produce different data 
	every time it is called. Therefore, special care has to be taken for this 
	case, from the developer as well as from the user site.
\end{description}




\subsection{Data batching}
\label{appendix:data batching}

Small datasets are internally simply converted to a Tensor and attached to the 
graph. Large datasets though, which either exceed the memory limit of the 
computing device or the limit of the graph size (which is 2 GB), need to make 
use of batched out-of-core computation. TF has a data handling class 
\pyth{Dataset}, which provides a performant way to 
do batched computations. It incorporates the loading of the batches from disk 
into the whole graph as several operations. This allows the runtime to split 
the execution in order to 
asynchronously load a batch of the data and run the graph of an already loaded 
data batch.

Another way of on-the-fly computation can be more generally be done with 
Tensors, since they can be 
used to instantiate \zdata{}. For example, \zdata{} has a subclass 
\texttt{Sampler}, which is specialised on this. It allows to evaluate the 
Tensor and store its value. This way, the Tensor is only re-evaluated when 
requested. The \texttt{Sampler} acts for example as the returned 
\zdata{} when sampling from a model. This data depends on the model but can be 
used like a normal data for example to construct a loss.

\subsection{Dependency management}
\label{appendix:dependency management}

The graph built by TF can be fully accessed, the parent operations of any 
operation is available. This enables to detect any 
dependency by walking along the graph. Or as TF does internally, to create the 
gradient. Using this in general 
can be risky though since for example caching with a \pyth{Variable} changes 
how the \textit{actual} 
graph looks like. Furthermore, it is also time consuming on a larger scale. 
Inside \zfit{}, it is used as an additional feature to figure out dependencies 
automatically of certain subgraphs. There is one type of independents that can 
change as also described in \ref{sec:parametrization}, \texttt{Parameter}, that 
other objects can depend on, directly or indirectly. To have a
dependency structure that is independent of the graph, \zfit{} has a 
\pyth{BaseDependentsMixin}. A 
subclass implements a method of returning the dependents for itself. This can 
then be done recurrently up to the independent parameters (see also 
\ref{sec:parametrization}) that return themselves. Any major base class 
implements the appropriate functions but requires for example to have a 
distinction between a model that depends only on parameters or also on other 
models. Both of them are \texttt{Dependents} but the model has to be aware to 
not only extract dependents from the parameters but also from the models.


To get the dependents from any object, \pyth{get_dependents} can be used, 
which returns a set of independent \texttt{Parameters}. This is fundamentally 
different from the \pyth{params} that each model has. The former will return 
all the \textit{independent} parameters that the model depends on. This can be 
any number of \textit{independent} parameters. The latter returns the exact 
parameters that are used in the function defining the shape of the model. For 
fitting, when tuning the parameters, the \pyth{get_dependents} should be used, 
since the actual changeable parameters matter. When reading off a value from a 
model, like the mean by accessing \texttt{mu}, the \pyth{params} has to be 
used. Using the latter for fitting can easily result in an error if not all of 
the parameters are independents, since the value of dependents (including 
constant) parameters \textit{cannot be set}.

\subsection{Base Model}
\label{appendix:basemodel}

From all the classes, the \texttt{BaseModel} and therefore also its 
subclasses, \texttt{BaseFunc} and \texttt{BasePDF}, contain the most logic. 
The implementation has a few peculiarities that will be highlighted. It is 
meant to be used as a base to implement custom models providing flexibility but 
ease of use at the same time, as discussed in Sec. \ref{sec:concepts}.
Therefore in this class a structure is provided where everything can be 
directly controlled \textit{but does not have to be}. The class takes care of 
anything that is \textit{unambiguous} but maybe cumbersome to do while leaving 
the full control to the user. The intended usage as a base class leaves it to 
the user to implement himself any method he wants to control directly. Namely 
the class provides the guarantee that any of the main methods can be overriden 
by changing the implementation of \texttt{\_method}, the same method name but 
with a leading underscore. Furthermore, any direct control can always be given 
back to the class by simply raising a \texttt{NotImplementedError}. The class 
acts as if the overriden method was never called. Furthermore, the base class 
takes care of some unambiguous handling of arguments (see below e.g.
\ref{sec:norm_range_handling}, \ref{sec:multiple_limits_handling}). It is 
mandatory to 
decorate each \text{\_method} with the \texttt{supports()} decorator. This 
allows to specify if the method can handle certain things, like normalization 
ranges. By default, this filters the more advanced arguments and handles them 
automatically. It is meant to provide a way to still allow specific 
implementations and workarounds.

\subsubsection{Public methods}
\label{sec:public_methods}
The internal logic of the methods \pyth{pdf} (and to some extend \pyth{func} 
and \pyth{unnormalized_pdf}), \pyth{sample} and all the \pyth{integrate} have 
a very similar layout. Their logic is split into a public part and more 
internal 
methods. We'll start with the outermost method, the public methods, and follow 
the subsequent method calls. There is a strict order on what will be called 
after which method, we follow the same order here.

A public method is a method starting without an underscore. It serves as the 
entry point to a model, asking it to do something. It is supposed to provide a 
clean API to the user as well as appropriate documentation of the method. The 
functional responsibility of 
the method is to clean the input which mostly means to take care of the 
ordering of dimensions and automatically convert certain input to a more 
general format. The following cleaning is done to the input

\begin{description}
	\item [limits] Limits for norm\_range or for sampling and integration can 
	be given as arguments to certain methods. First, the limits are 
	automatically converted to a \zspace{} \textit{if it's save to do so}. 
	Namely a simple 
	limit consisting of a tuple in a one dimensional case is converted, 
	anything else raises an error. Second, the method takes care of sorting 
	the space in the right way. Since each model has a \zspace{} with 
	observables assigned to it with a given order, the given or auto converted 
	\zspace{} is sorted 
	accordingly. This assures that the limits are internally in the right order.
	
	\item[data] As for limits, the data is first converted to the right format, 
	a \zdata{} \textit{if possible}. This will convert any input data that 
	coincides with the dimensionality of the model. While this is convenient 
	since it allows to directly feed numpy arrays and Tensors to a model, it 
	relies on the correct order of both the data as well as the model 
	observables. Since this can silently lead to mistakes by using the wrong 
	order, probably in the future this won't be possible anymore, given that a 
	simple conversion to \zdata{} can always be made.
	
	As a second step, the \zdata{} is sorted according to the observables of 
	the model, just like the limits. This ordering is done using context 
	managers that revert the reordering once the data exits the method.
\end{description}

The ordering is a crucial element to allow the direct usage of any object 
inside 
a model while having the matching ordering guaranteed.

\subsubsection{Hooks}
After the public method, two hooks follow. Assuming we have a public method 
named \texttt{\_method}, the first hook is named 
\texttt{\_single\_hook\_method}. Its purpose is to be directly called from the 
public method \textit{or} if the model itself needs to call one of it's own 
methods. The second hook called by the first is \texttt{\_hook\_method} and is 
used for repeated calls by the very same method that was called. This is useful 
if for example a scaling factor is supposed to be applied at the end of the 
calculations. If a method calls itself recursively, the scaling is supposed to 
be applied just once at the very end.
The general idea of hooks is to provide a convenient way to change the 
behaviour of a method \textit{without} altering the public method and its 
input cleaning. It provides the user the possibility to directly 
change any input before it is passed further down or any output right before it 
is returned. So any kind of advanced, model specific pre- or post-processing or 
setting of internal flags can be handled here. Some implementations inside 
\zfit{} make use of this. For example the implementation of the exponential 
shape uses the exponential-shift trick\footnote{A normalized exponential shape 
is invariant under translation. This can be used to increase the numerical 
stability by avoiding large number calculations and keep it around zero.}. This 
requires to determine the shift before the actual computation method is called. 
Whatever modification is applied in a hook, it is important that any hook 
always calls it's parent method to allow stacking hooks.

\subsubsection{Norm range handling}
\label{sec:norm_range_handling}
Following the hooks, \texttt{\_norm\_method} is invoked. It only exists if a 
normalization range is used inside the function. This methods responsibility is 
to automatically take care of the \textit{normalization logic} if the 
underlying function does not handle it itself. 
A \texttt{NormRangeNotImplementedError} can be raised by any deeper nested 
function which is caught here. For example in the case of an integral, the 
method first 
calls the subsequent method and if it catches an error, it splits into two 
calls: 
it calculates two integrals, each one \textit{without} a normalization range, 
one over the limits to be 
integrated over and one over the normalization range. The first is then divided 
be the second, which is the very definition of a normalized integral.

To illustrate this behaviour in pseudo code, a function \pyth{integrate} that 
takes the limits and the 
normalization range as arguments is assumed to exist. \pyth{False} as argument 
to the normalization 
range means the calculated integral is unnormalized.

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
try:
	integral = integrate(limits, norm_range)  # pseudo method that integrates
except NormRangeNotImplementedError:
	integral_unnormalized = integrate(limits, False)
	normalization = integrate(norm_range, False)  # integrate over norm_range
	integral = integral_unnormalized / normalization
\end{python}
\end{minipage}
\end{center}

This allows for a user to implement only a simple integral when overwritting or 
registering without the need to worry about 
its normalization. For most integrals this would anyway end in two times 
calling the integration function itself, basically what was done above. The 
default behaviour is that the normalization range will be automatically 
handled, as described in Appendix \ref{appendix:basemodel}. But it still leaves 
room for the 
possibility to implement a method that handles the integral as well as the 
normalization of it. There are special cases where this can be achieved with 
less computations than two times calculating the integral, such as in the case 
where the normalization range equals the limits.

\subsubsection{Multiple limits handling}
\label{sec:multiple_limits_handling}
Next in the call sequence the method \pyth{_limits_method} is invoked. It has 
the responsibility of handling multiple limits which are described in 
\ref{sec:multiple limits}. Equivalently to the way norm\_range is handled in 
\ref{sec:norm_range_handling}, multiple limits are caught here as well. 
Consecutive methods are called inside a \texttt{try-except} block in order to 
catch any \texttt{MultipleLimitsNotImpementedError}. If handling multiple 
limits is well defined, \pyth{_limits_method} is supposed to take care of it. 
As an example, the implementation of integration with a limit of \textit{n} 
limits is to split them into \textit{n} independent \zspace s, each with only 
one 
limit, and call the following methods with this new \zspace{}. It is in then a 
simple matter of summing the results.

Given some multiple limits as a \zspace{} to be integrated over by invoking the 
pseudo \pyth{integrate} function, the implementation looks like this

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
try:
	integral = integrate(limits)
except MultipleLimitsNotImpementedError:
	integral = 0
	for limit in limits.iter_limits():
		integral += integrate(limit)
\end{python}
\end{minipage}
\end{center}

\subsubsection{Most efficient method}

In \pyth{_call_method}, the actual functions are invoked. There are usually 
several choices for which function to invoke depending on availability and 
efficiency. The order chosen therefore starts with the most efficient 
implementation. If that raises a \pyth{NotImplementedError}, the next method is 
called.

\begin{itemize}
	\item First the \pyth{_method} is called and returned if successful. This 
	is a method that by default raises a \pyth{NotImplementedError} but can be 
	overwritten by the user. It is guaranteed to be executed in this case 
	making the public \pyth{method} call seem like a call to \pyth{_method}, 
	module input cleaning and limit handling. This asserts the full level of 
	flexibility in a model: any major function can completely be overwritten by 
	this procedure.
	
	\item If no explicit method is implemented, closely related alternatives 
	are invoked. For example, if \pyth{log_pdf} was called and \pyth{_log_pdf} 
	is not overwritten, \pyth{_pdf} is invoked and if implemented, its 
	logarithm is returned. As an other example, if \pyth{integrate} was called 
	and \pyth{_integrate} is not implemented, an analytic integral calculation 
	is performed. If that is not available and also raises an error, more 
	expensive alternatives are called.
	
	\item The freedom of \pyth{_call_method} is to try simple, but maybe 
	failing alternatives to return the desired. When all of the above fails, 
	the \pyth{_fallback_method} is invoked. This is the last resort and may be 
	quite a complex function. The calculations that it returns may be expensive 
	but the method is guaranteed to work. It must not fail if the \zmodel{} is 
	implemented correctly.
	
\end{itemize}


\subsubsection{Functors}
\label{model:functors}
\label{appendix:functor}

To implement a function just depending on data, the normal base class as 
described above is sufficient. 
But a model can also depend on other models. Creating compositions, as for 
example the \pyth{SumPDF} from Sec. \ref{sec:quickstart},
requires an additional tweak regarding the dependencies: a functor does 
not only depend on its own parameters but also on the sub models parameters 
associated with it. This is automatically taken care of by having a functor 
base class. Compared to normal models, they often don't need to define their 
observables but are inferred from the sub models.

\subsection{Sampling techniques}
\label{sec:sampling techniques}
\label{appendix:sampling techniques}

Implementing a sampling from an arbitrary distribution is not a 
straight forward task. The generation of uniformly distributed numbers is 
comparably simple and is used as a basis for any more advanced sampling 
technique. Two major ways are often used to sample from a distribution. If the 
inverse analytic integral function is known, then this can be used to transform 
a uniform distribution to the desired distribution by treating it as the sample 
on \text{y}. The inverse returns values proportional to the target 
distribution. This method is very fast and efficient since every 
drawn event is used. The downside is that it requires the inverse analytic 
integral, which is not available for most shapes, especially custom ones.

For a model where only the shape and no integral is known, the accept-reject 
method can be used. Thereby, samples are randomly generated in the model domain 
and evaluated. A random number is drawn between 0 and the maximum of the target 
shape for each event. If the value returned by the model is larger than the 
random number, the event is accepted. Otherwise it is rejected. The technique 
is illustrated in Fig. \ref{fig:accept_reject}.

\begin{figure}[tbp]
	\begin{subfigure}{.5\textwidth}
		\includegraphics[width=\textwidth]{figs/accept_reject.png}
		\caption{}
	\end{subfigure}%
	\begin{subfigure}{.5\textwidth}
		\includegraphics[width=\textwidth]{figs/accept_reject_importance.png}
		\caption{}
		\label{fig:imporance_accept_reject}
	\end{subfigure}%
	\caption{Visualization of the accept reject method. Proposed events are 
	randomly sampled in the valid range. In a) a uniformly sampled y value and 
	in b) a Gaussian shaped y is used to either accept or reject them. The 
	black like is the true shape of the model. The orange line represents the 
	distribution the y were drawn from. Blue values are accepted, red are 
	rejected.}
	\label{fig:accept_reject}
\end{figure}

This can be very inefficient though for peaky distributions since all red 
values are lost. An increase in efficiency can be achieved by sampling from a 
distribution that follows better the target shape, so called ``importance 
sampling", as shown in 
\ref{fig:accept_reject} on the right. It needs to take a little bit more into 
account, namely

\begin{description}
	\item[target] This is the distribution we want to approximate. 
	
	\item[probability] The target probability is the function value of the 
	target shape at a given x. While called probability, this does not have to 
	be normalized to anything but be proportional to a real probability.
	
	\item[proposal] The proposed sample is the events that will be either 
	accepted or rejected. They are drawn from the proposal distribution.
	
	\item[weights] This is the probability (or proportional to it) of an event 
	from the proposal being drawn from the proposal distribution.
	
	\item[rnd] A random number drawn uniformly between 0 and 1 to decide 
	whether to accept or reject an event.
\end{description}

To approximate the target, a sample is drawn. To accept or reject samples, the 
following is checked

\begin{equation}
accept = prob_i < weight_i \cdot rnd
\end{equation}

This is only unbiased if $all(prob_i < weight_i)$. Otherwise the distribution 
will be misshaped as shown in Fig. \ref{fig:importance_biased}

\begin{figure}[tbp]
	\centering
	
	\includegraphics[width=0.5\textwidth]{figs/importance_gone_wrong.png}
	\caption{Importance sampling with a wrong scaled weight. The sampled 
	Gaussian distribution (blue) is cut on the top and does not resemble the 
	correct shape.}
	\label{fig:importance_biased}
\end{figure}

Therefore to be unbiased, the weights have to be scaled enough. This can though 
lead to problems if the weight is significantly lower (even if only in a single 
point) than the target probability. Since this requires a large rescaling, the 
sampling of the rest of the target gets rendered inefficient. Therefore it 
is important that the proposal distribution matches the target reasonably well. 
Most importantly the maximum and minimum of the ratios of the two distributions 
should be as close as possible.

\subsection{Loss defined}
\label{appendix:loss defined}

The following parts are provided by a loss to enable minimisation

\begin{description}
	\item[value] The actual value of a loss is needed since it is the desired 
	object to be minimised. The method \pyth{value} returns the Tensor that can 
	be run with \pyth{zfit.run(...)}. This Tensor contains typically the heavy 
	computations.
	
	\item[parameters] The loss depends on one or more parameter that is 
	floating. They are associated with models and change the shape 
	thereof. As for a model, all dependents can be retrieved by using 
	\pyth{get_dependents}. Changing their values affects the value of the loss.
	
	\item[gradients] Calling \pyth{gradients} returns the gradients of the loss 
	with respect to the parameters. Gradients provide a helpful tool for the 
	minimisation algorithm.
\end{description}