Master_thesis/thesis/model.tex at c134fb6539b3784e48504f1927a9485a67aecfd6

Fork: 0
saslie / Master_thesis
Find file
Newer
Older
Master_thesis / thesis / model.tex
Sascha Liechti on 30 Sep 2019 20 KB Minor updates
Raw Blame History
\subsection{Model}
\label{sec:model}

Building models is the core competence of \zfit{}. As seen in the 
example in Sec.
\ref{sec:quickstart}, this can be done in a simple manner by using 
already implemented models and possibly combining them, but the models can also 
be completely custom built. Within \zfit{}, there are two basic types of models 
to cover most cases: 
Functions and Probability Density Functions (PDF).

The basic features of a model include

\begin{itemize}
	\item Each model is defined inside a \zspace{}. Its dimension are 
	``observables", simple string identifiers as previously discussed.
	\item A model implements a function that returns its value 
	depending on some data. This is either \pyth{pdf} for a PDF or \pyth{func} 
	for a Func.
	\item Full as well as partial integration over a model is 
	possible. This includes numerical as well as analytical integration, if 
	available.
	\item Generating a sample following the models shape using 
	numerical or analytical methods, the latter only if available.
\end{itemize}

The value function as well as the integration and sampling are implemented to 
return pure 
Tensors. Depending on the task, higher-level methods providing either a more 
intuitive, imperative behaviour or a significantly more 
performant execution for certain cases such as repetitive pseudo-experiments 
are also implemented.

The main differences between a PDF and a Function concern the normalisation 
and the output dimensionality. This leads to a few subtle differences.
\begin{description}
	\item[PDF] A PDF $f(x)$ is only well-defined with a given normalisation 
	range. 
	This defines the normalisation constant so that, with the limits from 
	$lower$ to $upper$, the integral over the PDF equals to one as
	
	\begin{equation}
	\label{eq:norm_range}
	\sum_{i}\int_{l_i}^{u_i} f(x) dx = 1
	\end{equation}
	with $i$ indexing all the limits that make up the domain, $l_i$ and $u_i$ 
	are the lower respectively upper limit.
	
	A PDF object has three special attributes, which are
	
	\begin{itemize}
		\item 
		the probability density function \pyth{pdf}. It returns always a rank 
		one Tensor, 
		i.e. a simple vector, with the length number of data points.
		\item the probability density function
		\textit{without} the normalisation constant, \pyth{unnormalised_pdf}. 
		In cases where only the 
		shape is 
		needed, using this function is less expensive, especially
		if the normalisation has to be computed numerically.
		\item the limits that define the normalisation constant of Eq. 
		\ref{eq:norm_range}. They can be set using the 
		\pyth{set_norm_range} 
		method.
	\end{itemize}

	\item[Function] A Function is in a way more simple and general than a PDF. 
	It takes the same values as a PDF but returns something with dimensionality 
	$\mathbb{R}^m$. It can be used to transform values or to use it as a 
	building block for more complex expressions.
	
	It has the method \pyth{func} to evaluate its value and no other special 
	attributes.
	
\end{description}


\subsubsection{Parametrization}
\label{sec:parametrization}

Models can be parametrized by \pyth{Parameter} objects which can be used in the 
implementation of the shape function like 
any other 
Tensor-like object when building the model. In 
the following we will first have a look at the \pyth{Parameter} itself. 
Afterwards we will see how they are exactly used with a model.

There is a distinction between dependent and independent parameters as only the 
latter can be changed directly and have limits while the former are any 
arbitrary combination of them.
An independent \pyth{Parameter} 
\begin{itemize}
	\item has a name with purely descriptive purpose;
	\item has an initial value;
	\item maybe has lower and upper limits;
	\item is either floating or not, independent of whether limits were 
	specified;
	\item has a step size indicating the order of magnitude of the parameter 
	which can be given. Otherwise, it is automatically inferred. A 
	well chosen step size improves the minimization process and can be 
	critical, mostly in the absence of limits and with a weak dependence on the 
	model, so that large steps will be required to change the model.
\end{itemize}


Currently, the shape of parameters is implicitly restricted to a 
scalar.\footnote{This 
comes from the fact that the PDF will have different sized data as input which 
is not controlled by the user, such as when doing numerical integration. Any 
parameter therefore has to be able to broadcast seamlessly.} As a consequence, 
a 
parameter cannot simply be treated like data as it is possible in \roofit. A 
function \pyth{Parameter} which depends on the data itself will most likely be 
available in the future.

The name of a parameter, as any other name for a single object in \zfit{}, has 
purely descriptive character. There is purposely\footnote{Contrary to the 
\root framework. If the need ever arises, adding this as an additional feature 
is relatively simple.} no direct way provided to access parameters by name:
instead of using names the actual parameter object is passed around. This 
has the advantage of avoiding double bookkeeping the parameters, since then a 
reference on an object as well as on a string would be needed. The name is 
mandatory for parameters, as opposed to other objects in \zfit{}, since 
matching the value of a parameter to the right name is a critical task during 
and after a minimization, when reading off the right value.

Composed parameters are dependent parameters. In general they can be any 
Tensor, that is a result of any kind of operation. They can depend on zero, one 
or 
more independent parameters, as composed parameters are arbitrary functions; 
therefore, operations such as shifting and scaling are included as a trivial 
subset. For more details on the 
actual implementation and the dependency management see Appendix 
\ref{appendix:dependency management}.
Composed parameters can neither be floating nor have limits currently, since 
the independent parameters a composed parameter may depends on can have 
arbitrary relations and have to be restricted themselves. In order to restrict 
a parameter, arbitrary constraints can be given to the loss instead.

Every model can depend on multiple parameters, both dependent and independent. 
Each parameter that parametrises the model has a name 
specific to the model and is given on instantiation. For example a 
\texttt{Gauss} has parameters named \texttt{mu} and \texttt{sigma} as in the 
example in Sec. \ref{sec:quickstart}. They are stored in a mapping attribute 
named \pyth{params}. 

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
mu_param = gauss.params["mu"]
sigma_param = gauss.params["sigma"]
\end{python}
\end{minipage}
\end{center}

Notice that this is not contradictory to the statement above that single 
objects \textit{cannot} be accessed by \textit{their} name. Each object has a 
unique identifier, the name, but objects can have names for their constituents 
that are \textit{not} unique, like \texttt{mu}. This simply describes a part of 
the PDF and any \texttt{Gauss} will have a parameter \texttt{mu}.


\subsubsection{Implementing a custom PDF}

An essential feature of \zfit{} is the ability to simply create custom models. 
There is a large freedom in building models from the \pyth{BaseModel} class, 
since 
it takes care of most boilerplate and has well defined entry points than can be 
customized. The full implementation details and possibilities for 
customizations are described in Appendix \ref{appendix:basemodel}.

For the most 
common use cases though, there exists a simple way of creating a custom model.
The \pyth{ZPDF}, basically a more user 
friendly wrapper around \pyth{BasePDF}, can be be used as a base class in these 
simple cases. 
The following function will be used as the PDF shape
\begin{equation}
\label{eq:simple_linear_model}
f(x, y) = a \cdot x^2 + b \cdot y^4.
\end{equation}

To implement this function, 
\pyth{_unnormalized_pdf} has to be 
overwritten. For the vast majority of custom models, this is the only method 
to be overwritten. Changing other methods, especially \pyth{_pdf}, is an 
advanced feature and only needed in special cases. For more details on the 
customization and the possibility of hooking into the calls see Appendix 
\ref{appendix:basemodel}.

This is a two dimensional PDF with two parameters. To implement 
it in \zfit{}, a new class is created

\begin{center}
\begin{minipage}{\textwidth}
\label{py:custom_pdf_linear_model}
\begin{python}
class EvenPolyPDF(zfit.pdf.ZPDF):
	"""Implementation of f(x, y) = a*x^2 + b*y^4"""
	 _PARAMS = ['a', 'b']
	 _N_OBS = 2  # since two dimensional
	 
	def _unnormalized_pdf(x):
		xdata, ydata = x.unstack_x()
		a = self.params['a']
		b = self.params['b']
		return a * xdata ** 2 + b * ydata ** 4
\end{python}
\end{minipage}
\end{center}


Note that we only need the \textit{shape} of the function but do not need to 
take care of the normalisation, as numerical methods are already implemented in 
the base class.

We can see the advantage of the preprocessing that is done by the base class, 
especially the reordering of the data \pyth{x}. It is a \zdata{} object and 
calling the \pyth{unstack_x} method returns a list\footnote{Or a single Tensor 
fo the one dimensional case if not specified differently in the arguments.} 
containing a Tensor for each column sorted 
according to the models observables. The length of the list has to 
coincide with the specified \pyth{_N_OBS}, the number of observables. It is not 
mandatory to specify this field in a Model and is sometimes not possible to 
know previously, in 
which case it can simply be left away. 

The naming of the parametrization of the function is defined with 
\pyth{_PARAMS}. These exact names have to be used when creating an instance of 
the model. The parameters are then stored in the \pyth{params} dictionary and 
extracting them
is usually the first step inside \pyth{_unnormalized_pdf}.

Documentation plays an important role here: it defines the name of the 
parameters that have to be used and what they represent in the function, but it 
is also crucial to communicate the ordering of the data. In this case, a user 
can see that the first observable and the corresponding column in the 
data will be used as \textit{x} in the function.

This PDF is already complete and works out of the box. We can create an 
instance and use its methods. As an example, the observables of the instance 
will be called "xobs" and "yobs" with the normalisation range going from $0$ to 
$10$ and from $0$ to $5$, respectively.

\begin{center}
\begin{minipage}{\textwidth}
\label{py:instance_custom_pdf_linear_model}
\begin{python}
param_a = zfit.Parameter("a", 1.)
param_b = zfit.Parameter("b", 2.)
x_obs = zfit.Space("xobs", limits=(0, 10))
y_obs = zfit.Space("yobs", limits=(0, 5))
obs = x_obs * y_obs
poly_model = EvenPolyPDF(a=param_a, b=param_b, obs=obs)

prob = poly_model.pdf(...)  # some data needed

x_limits = zfit.Space("xobs", limits=(3, 5))
y_limits = zfit.Space("yobs", limits=(1, 3))
integral_limits = x_limits * y_limits
integral = poly_model.integrate(integral_limits)

sample = poly_model.sample(n=1000)
\end{python}
\end{minipage}
\end{center}

Using the \pyth{integrate} method, we obtain the integral $i$ of our normalized 
PDF $f_{norm}$ 
\begin{align}
\label{eq:model integral}
	i &= \int_3^5 \mathrm{d}x \int_{1}^{3} \mathrm{d}y ~f_{normed} (x, y) \\
	  &\stackrel{(\ref{eq:pdf from func})}{=} \frac{\int_3^5 \mathrm{d}x 
	  \int_{1}^{3} 
	  \mathrm{d}y ~f(x, 
	  y)}{\int_0^{10} 
	  \mathrm{d}x \int_{0}^{5} \mathrm{d}y ~f(x, y)}
\end{align}
with $x$ and $y$ corresponding to the observables \pyth{x} and \pyth{y}, 
respectively, and $f(x, y)$ is the unnormalised PDF as defined in Eq. 
\ref{eq:simple_linear_model}. Using Eq. \ref{eq:pdf from func} we end up with a 
precise formulation of what is 
\textit{actually} executed in \zfit{}.
In other words, the integral was
calculated over the limits $3$ to $5$ and $1$ 
to $3$ 
respectively and normalised over the normalisation range \texttt{obs}. The 
latter is defined  on initialisation and taken as the default 
normalisation range. As mentioned before, the limits could 
also be multiplied in a different order resulting in \texttt{limits} having the 
order (``yobs", ``xobs"). The model takes care internally (see also Appendix 
\ref{sec:public_methods}) that the right limits are at the right place. 

The \pyth{sample}  method is used to generate 1000 points from the model.
With the instance created, also probability densities \pyth{prob} for points 
can 
be calculated by calling the method with some \zdata{} object called 
\pyth{data} as follows

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
probs = poly_model.pdf(data)
\end{python}
\end{minipage}
\end{center}

It is worth noting that none of 
the 
operations are executed yet and what is returned by the methods are 
\text{Tensors}, as described in Sec. \ref{sec:implementation}. To run the 
actual 
computations, it is necessary to call \pyth{zfit.run(...)} on any Tensor. 
Running
\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
probs_np = zfit.run(probs_np)
integral_np = zfit.run(integral)
\end{python}
\end{minipage}
\end{center}
returns an actual numpy array for \pyth{probs_np} and a Python float for 
\pyth{integral_np}.

For further, advanced customization of the PDF, methods can be overridden as 
described in Appendix \ref{sec:public_methods}. However, in most use cases 
there exist better ways: for example, to add an analytic 
integral to the model, overwriting \pyth{_analytic_integrate} should be the 
last resort. Instead, integrals
defined over a specific range only or the whole range and over all dimensions 
or just partially can be registered with a model. A priority attribute allows 
to specify preferences on one 
over other methods. 

A typical use case for this feature is a special integral 
that is known 
exactly, for example the value of the integral over the full space of a 
Gaussian shaped model\footnote{As already hinted in Eq. \ref{eq:model 
integral}, it is important to 
note that the integral \textit{to be implemented} is over the 
unnormalised PDF as implemented in the \pyth{_unnormalized_pdf} method.} An 
integral over both dimensions is registered but partial integrals 
could be added as well

\begin{equation}
\label{integral_xy}
\int_{x_0}^{x_1} \int_{y_0}^{y_1} f(x, y) \mathrm{d}x \mathrm{d}y = (1 / 3 
\cdot a 
\cdot x^3 + 1 / 5 \cdot b \cdot y^5) \Big|_{x_0}^{x_1} \Big|_{y_0}^{y_1}.
\end{equation}

In case a full integral is requested but no analytic 
integral 
over all dimensions is available, the fallback of \pyth{integrate} looks for 
partial integrals. If available, it uses them to numerically integrate over the 
remaining dimensions instead of using the unnormalised PDF.

The implementation of the integral with \zfit{} is done by registering it to 
the PDF

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
def integral_full(x, limits, norm_range, params, model):
	lower, upper = limits.limit1d
	a = params['a']
	b = params['b']

	lower = ztf.convert_to_tensor(lower)
	upper = ztf.convert_to_tensor(upper)
	def indef_integral(limit):
		return 1 / 3 * a ** 3 + 1 / 5 * b * y ** 5

	return indef_integral(upper) - indef_integral(lower)

lower = ((zfit.Space.ANY_LOWER, zfit.Space.ANY_LOWER),)
upper = ((zfit.Space.ANY_UPPER, zfit.Space.ANY_UPPER),)
limits = zfit.Space.from_axes(axes=(0, 1), limits=(lower, upper))

EvenPolyPDF.register_analytic_integral(func=integral_full, 
													             limits=limits)

\end{python}
\end{minipage}
\end{center}

Since the observables that will be assigned to each axis are unknown, the 
\zspace{} has to be defined with axes, not with observables. This follows the 
principle of using axes inside the model as explained in \ref{sec:spaces and 
dimensions}. Instead of \pyth{ANY_LOWER} and \pyth{ANY_UPPER}, it could also be 
defined over a specific domain 
by using plain Python floats. The PDF will now automatically use this integral 
if possible. Otherwise, as for example when creating a partial integral, it 
will silently fall back to numerical integration.

This example demonstrates how to implement a shape that does only depend on the 
data and parameters. Models, that also depend on other models, such as the 
\pyth{SumPDF} from Sec. \ref{sec:quickstart}, require to be built from a 
\pyth{Functor}. This implies as a minor change only an additional argument to 
the base 
class, the depending models, in order to track the dependencies 
correctly. This is described in more details in 
Appendix \ref{appendix:functor}.


\subsubsection{Sampling}

Sampling from a model can be done in two different ways to cover two distinct 
use cases.
On one hand, the method \pyth{sample} returns a pure Tensor, which behaves as 
any other 
sampling algorithm in TF, for example \pyth{tf.random.uniform}. This can be 
useful for specific and 
mostly advanced cases, but he overall behaviour can be unintuitive for 
inexperienced users, as discussed in Appendix \ref{appendix:tensorflow}.

On the other hand, sampling from a 
model is typically used to perform toy studies, which consists of the 
generation 
of events 
according to the model and afterwards a fit of the model with randomly 
initialized parameters to the sampled data. With \pyth{create_sampler}, a 
\zdata -like object which handles 
the 
correct storage and the sampling from the model is created. As any other 
\zdata{}, it can 
be used to build a loss. The actual data is sampled by invoking \pyth{resample} 
and stays unchanged until it is invoked again. This sampler keeps a dependency 
on the original model and uses it to sample, keeping the original 
parameter values it was created with. If desired, any parameter that has 
changed in between will be at that new value when \pyth{resample} is called, 
effectively changing the sampling shape. 
\footnote{Just to be clear: the default behaviour is that the sampler 
samples from the exact distribution that it was created with \textit{in terms 
of parameter values}.}

To illustrate the use of the toy sampler, let's look at a toy study example. 
Note that the actual (large)
\textit{objects} needed 
such as the model, minimiser, loss etc. are created \textit{outside} of the 
loop and only the necessary \textit{methods} are called \textit{inside}. In 
general, this removes boilerplate code from the additional object creation but 
is especially important when working with a graph based backend.
We assume to have previously built a model already and a minimizer as for 
example shown in Sec. \ref{sec:quickstart}, and we generate 1000 events 
according to our model.

\begin{center}
\begin{minipage}{\textwidth}
\begin{python}
sampler = model.create_sampler(n=1000)
nll = zfit.loss.UnbinnedNLL(model, sampler)
for _ in range(n_toys):
	sampler.resample()  # now the sampling happens
	for param in model.get_dependents(only_floating=True):
		param.set_value(np.random.normal())  # any initial value
	result = minimizer.minimize(nll)
	
	if result.converged:  # check if the fit was successful
		...  # safe results in a list or similar

\end{python}
\end{minipage}
\end{center}

The sampling is implemented with the accept-reject method that works with 
arbitrary shapes. If an analytic 
inverse integral is registered, this will be used, providing a more efficient 
way of sampling. More details about the different techniques and the 
use of importance sampling are described in Appendix 
\ref{appendix:sampling techniques}.

\subsubsection{Extended PDFs}

In addition to the shape, a PDF can carry more information, namely a yield, in 
which case we refer to it as an Extended PDF. 
The yield is a scale that describes the magnitude of the PDF, typically 
reflecting 
the number of events in the data sample. It can be used to multiply the output 
of \pyth{pdf}, the normal probability density function, which results in a 
number probability. This implies that when multiplying the integral with the 
yield, the number of events is retrieved instead of the probability over a 
certain range.\footnote{\textit{Currently}, the integral of an 
extended PDF is multiplied by the yield \textit{by default} but this behaviour 
is meant to change in a future version}. 
To count for example how many signal particles are in 
the sample, a composite PDF existing of a background and a signal component, 
both extended, 
can be fitted to the data. The \pyth{SumPDF} used to build this composition 
does not need a fraction if the PDFs are already extended, but uses the yields, 
normalised to one, as fractions. In this case, integrating the PDF that 
represents the signal 
shape over the whole range returns the number of signal events.

To create an extended PDF, \pyth{create_extended} can be used. This method 
returns a copy of the PDF and adds a yield to it.