Master_thesis/thesis/landscape.tex at c134fb6539b3784e48504f1927a9485a67aecfd6

Fork: 0
saslie / Master_thesis
Find file
Newer
Older
Master_thesis / thesis / landscape.tex
Sascha Liechti on 30 Sep 2019 8 KB Minor updates
Raw Blame History
\subsection{Requirements}
\label{sec:requirements}

Some features are crucial in order to implement a model fitting library. An 
important part of model fitting is the model building 
itself, but a library 
should also offer a convenient, transparent creation of the loss and the 
minimisation. Especially in HEP, the following features are essential:

\begin{itemize}
	\item
	PDFs are by definition normalised over a certain range. In most other 
	libraries and fields, the domain is assumed to be $(-\infty, \infty)$. In 
	HEP, this 
	is basically never the case and a finite 
	normalisation range is used.
	\item 
	Fits in HEP are often more than one-dimensional. The framework should 
	therefore naturally extend to higher 
	dimensions.
	\item Building and combining models from basic shapes like 
	Gaussian or exponential functions only suffices for simpler cases, but this 
	is 
	often not enough to build more complicated or specific models.Therefore, a 
	convenient way to implement custom models has to be provided.
	\item Reasonable scaling with the data size and the model 
	complexity is a 
	key criteria. This is often especially hard to achieve in combination with 
	the ability of specifying custom models, since the latter usually requires 
	to 
	have the parallelization implemented by the user.
	\item While the minimisation of the loss yields an optimal 
	value for each parameter, it is crucial in HEP to also know the uncertainty 
	of the value. This requires the library to have a transparent way of 
	handling the parameters and their uncertainties as well as to provide the 
	flexibility to perform an advanced statistics treatment.
\end{itemize}


\subsection{Existing libraries}
\label{sec:landscape}

Model fitting itself is nothing new. In fact there are already a lot of model 
fitting libraries available. Some of these libraries are also written in Python 
and cover a similar scope as \zfit{}. Building a new fitting 
library from scratch sounds therefore like 
reinventing the wheel and should be avoided if not necessary. But as already 
discussed in Sec. \ref{sec:Introduction}, fundamental changes in the 
computing architecture are leading to vectorized paradigms. Additionally, the 
needs in HEP for larger and more complicated while still flexible fits require 
to keep up with the state-of-the-art in computing. And this sometimes requires 
a re-invention.

However, it is an imperative to make sure that no existing library already 
fulfils the needs or can be extended to. And even if concluding that a new 
library is the way to go, as much as possible should be learned and taken from 
any existing library in order to reinvent as few as necessary. In the following 
an overview of already existing libraries is given.


\subsubsection{General fitting}

Fitting models to data is a task that is performed in a variety of fields 
independent of HEP. Different general fitting libraries exist in Python, but 
they often contain functionality not 
actually needed in HEP, such as mean, variance, survival function, and lack 
central features like a custom normalisation range or the extension to more 
than 
one dimension.

\begin{itemize}
	
\item Scipy\cite{software:scipy} is the go-to library for scientific 
calculations in Python and provides an extensive toolbox for statistical and 
numerical methods. There is a module with distributions that have proven to be 
stable and work well. Downsides of the package include a non-optimized
implementation in terms of parallelisation and lack of support for composite 
models.

\item lmfit\cite{software:lmfit} shares a lot of its design in terms of naming 
and concept to \zfit{}. It is built for model fitting, has parameters, 
minimisers, fit results and more. It lacks more advanced features like 
the possibility of normalisation ranges for PDFs or good scalability, since it 
is built on top of numpy, a fast numerical library in Python, 
and scipy, which strongly limits the ability for massive 
parallelisation.

\item TensorFlow Probability\cite{DBLP:journals/corr/abs-1711-10604} provides a 
library for statistical reasoning. Its focus is on 
analytical functions and only marginally extends to numerical and Monte Carlo 
methods, which limits its application to analytically integrable functions. 
Interestingly, it contains a lot of of features that can be used inside or 
together with 
\zfit{}, such as Bayesian inference with MCMC sampler and 
analytic functions with integrals already implemented in TF.

\end{itemize}

\subsubsection{HEP specific}

A wide range of specialised fitters exist in HEP. The overview here is 
limited to general purpose fitters which can be used from Python.

\begin{description}

\item \roofit\cite{Verkerke:2003ir} is the de-facto standard tool for fitting 
in HEP. 
Models are built using classes and provide automatic normalisation and 
integration. There is support for binned as well as unbinned fits. \roofit 
itself extends beyond that and offers also an extensive plotting and statistics 
module.
While the library has proven itself in numerous analyses over the years, and 
the model building part of 
\zfit{} is actually inspired by the core of \roofit, there are several 
shortcomings 
which are meant to be addressed with \zfit{}:
\begin{itemize}
	
\item \roofit is not a native Python 
library but can only be accessed through the Python bindings to \root. Since 
\roofit manages its own memory in C++ and Python uses a garbage collection as 
well, this can lead to memory leaks and completely undefined behaviour. 
\item Since 
the Python interface is barely a wrapper around the C++ classes, it does not 
integrate well to the scientific Python stack.
\item In terms of flexibility, \roofit 
offers the possibility to be extended up to a certain degree with custom 
classes in pure C++. But especially when used from Python, it does not provide 
a convenient way to define custom PDFs. 
\item While there are 
improvements in the pipeline, it is not natively optimized to run vectorized on 
multiple cores 
or even accelerators like GPUs. 
\item Since the usage requires \root, the 
installation and setup is typically not lightweight.

\end{itemize}

\item probfit\cite{software:probfit} is a fitting library written in pure 
Python that mainly uses Cython to perform the heavy computations. This is a 
limitation 
in terms of performance and custom PDF 
implementations that makes a possible extension hard. 
Since it 
does provide limited features only, a large extension would be needed together 
with a major conceptual overhaul to be able to include new features.

\item pyhf\cite{software:pyhf} is a re-implementation of HistFactory from ROOT 
in Python. It makes use of TensorFlow and other libraries including PyTorch and 
Numpy as a backend. It is purely designed to do binned template fits and does 
not extend its functionality beyond that point.

\item The CMS Combine Tool\cite{higgsanalysis_combinedlimit} contains a subpart 
that implements template 
fits in TF. It does not extend 
its functionality further and is currently not available as a stand-alone 
package. Several 
useful parts like likelihood profiling or a minimiser in 
pure TF have been implemented there.

\item TensorFlow Analysis\cite{tensorflow_analysis} is a library with a 
simple, functional approach to 
built the loss with TF and use Minuit\cite{James:1975dr} directly 
inside\footnote{This also requires to have the \root package installed.} to 
find the minimum. 
It offers a lot of 
physics content to create a model. While the lightweight approach comes with a 
lot of flexibility, the library also
leaves quite some work to the user. For example it does not offer anything 
close to model 
composition with automatic normalisation. Notably, in its current state, the 
library 
lacks Python 3 support. However, its importance has to be stressed since it 
demonstrated the feasibility of using TF for 
unbinned likelihood fits with complex models and was a major inspiration for 
the development of 
\zfit{}.

\item[TensorProb] is a model fitting library in Python that uses TF as the 
backend. In general it was built with a similar goal in mind as \zfit{}, 
providing a model fitting library in Python using TF,
but using more an experimental approach. It offers models that can also
integrate and sample. The content 
is based on older TF versions and the library is strongly limited in 
functionality. Most 
importantly though, the project never grew out of its experimental status and 
has been discontinued. It recommends now to use \zfit{} instead.
\end{description}

While the discussed model fitting libraries have different 
strengths and weaknesses, no single one fully fulfil the needs of HEP. 
However it is worth pointing out that their demonstration of concepts, designs 
and even certain functionality that 
can be used directly with \zfit{} are essential pieces in the 
development of \zfit{}.