Newer
Older
Master_thesis / thesis / introduction.tex
\section{Introduction}
\label{sec:Introduction}

The Standard Model (SM) of Particle Physics describes the most fundamental 
particles in the universe and their interactions. According to it, all matter 
is made up of fermions: quarks and leptons. They appear in different flavours 
and generations as depicted in Fig. \ref{fig:sm}. 
There are additionally four gauge bosons that allow 
the particles to interact via their exchange: the photon is the 
electromagnetic force carrier and couples to particles with an electric charge, 
such as the electron and the quarks. W and Z bosons are the carrier of the weak 
force and couple to all fermions. Unlike other forces, the strength of the 
strung interaction
increases with the distance of two particles. As a consequence, quarks do not 
appear alone in nature. Instead, 
they form composite particles existing of multiple quarks, the hadrons. While 
there are hundreds of hadrons, nearly all of them have 
lifetimes under nano seconds and decay to lighter particles. The only 
stable particles are the well known neutron, proton,
electron and neutrino, which together make up the visible matter in our 
universe. Finally, all particles in the SM (except neutrinos) acquire mass 
through their interaction with the Higgs Field by the exchange of Higgs Bosons.

\begin{figure}[bp]
	\centering
	\includegraphics[width=0.5\textwidth]{figs/sm_overview.png}
	\caption{The particles of the SM.}
	\label{fig:sm}
\end{figure}

With the  recent discovery of the Higgs boson, the last missing piece of the SM 
has been found. It provides a complete description of nearly all 
observations. And yet it does not seem to be the final answer since there are 
phenomena that remain unexplained, such as the existence of dark matter 
that interacts gravitationally and significantly determines the dynamics of 
galaxies does not appear in the SM, and the fact that neutrinos have mass 
since they oscillate does not coincide with the predictions of the SM. With 
larger amounts of data collected, more precise measurements need to be made in 
order to look for further inconsistencies of the SM that can guide us to a new 
theory.

In the scientific context, what is called an observations is in fact an answer 
extracted from nature by 
asking the right question and using statistics to analyse the response of the 
data. A question in this sense is an 
experimental setup and a scientific hypothesis is 
a proposed 
explanation for an observed phenomena which can be tested. Different methods 
can be used to verify a hypothesis, but all of them make use of a test 
statistic that needs a single value to quantify their agreement with the 
observations. Given strong enough 
evidence, the null hypothesis may be rejected in favour of an alternate 
hypothesis.
As an example, this can be used for the discovery of a new particle where the background only hypothesis acts as the null and the alternate is the background and signal combined.


As already mentioned, observations from experiments are needed. To study the 
fundamental particles of the SM, high enough energies 
are 
required to produce them. High Energy Physics (HEP) experiments all over the 
world accelerate light particles such as electrons or protons and let them 
collide.  
The Large Hadron Collider (\lhc) at \cern accelerates protons to energies up to 
6.5\tev\footnote{Natural units
	with $\hbar=c=1$ are used throughout.}, which is currently the frontier in 
	high energies.
Around the collider, 
there are four large experiments: the general purpose detectors \atlas and 
\cms, \alice, an experiment specialized on lead interactions and \lhcb, 
a detector focused on 
the study of heavy flavour decays. These experiments are situated at collision 
points around the 
\lhc  where 40 million collisions occur per second. 
Due to the high concentration 
of energy on collision, heavier particles are created and decay immediately to 
lighter particles. The experiments measure the tracks and properties of the 
decay products that pass through the different detectors. The raw readout of 
those is forwarded to a computer 
farm, where events of 
interest are marked and kept whereby the rest of the events are not recorded. 
This 
reduces the stream of data to a frequency that allows it to be 
stored on persistent storage. From there it can be retrieved and used for 
further, offline 
analysis. 

Performing a full analysis to measure physics observables from the data 
involves 
several steps. This includes among others cleaning the samples by applying 
selection criteria, reweighting to correct for systematic effects or creating
new features that better describe the event. The sample can then be used to 
directly infer 
unknown parameters by using physically motivated models and performing a fit to 
the data. All of these analysis steps require convenient, reliable and fast 
libraries together with enough computing resources. To accomplish this, a lot 
of 
code is being written and stacked upon each other. To cope with the ever 
increasing amounts of data, both in real-time event filtering as well as 
offline analysis, it is 
mandatory to keep the computing level all in all at the state of the art.


Computing is still a comparably young, fast moving field. Hundreds of general 
programming languages exist, most of them do not stay for a long time or only 
exist in a specialist community. Even longer lived languages have both
advantages as well as shortcomings. This leads to different fields adopting a 
few, or even one, main languages that specifically well serve their purpose. 
The field of scientific computing mainly involves either simulation of systems 
or data analysis. In both cases languages that are fast to do number crunching 
are required. 
Among the most popular languages for heavy computations are Fortran and C/C++. 
The former is over six decades old and still used up to these days. It was 
designed for numerical processing and contains optimizations that still 
outperform other languages. C and its superset C++, the most 
popular language in HEP, date back about four 
decades 
and are built for more general usage than Fortran is. Although 
being fast and a powerful general 
programming 
language, it does not offer the convenient abstractions that 
scripting languages offer. It allows but also requires to manually handle 
certain resources, such as 
memory allocation, and is limited in terms of flexibility because it is a 
statically compiled language. While the latter feature allows for high 
performant execution, interpreted scripting languages such as Python offer 
additional comfort and flexibility. While the execution of actual Python code 
can be 
significantly slower than comparable static compiled languages when it comes to 
pure number crunching, the huge Python package ecosystem offers a lot of 
libraries that implement time consuming mathematical operations in a more 
efficient language, such as Fortran or C++. This makes Python a higher level 
library that abstracts away the handling of computation demanding 
operations through external function calls. Together with an 
especially clean syntax style, Python code is expressive and natural to read, 
giving in only a small penalty for the performance from the overhead of the 
external calls.

This combination and a fast growing open source-community has established 
Python as the most popular language for data analysis. With the recent advances 
in Machine Learning and the rising popularity of Big Data analysis among 
industry leaders, the size and quality of the scientific Python ecosystem has 
made a huge leap forward. Topics like deep learning, which require highly 
optimised code due to the abundance of vectorised matrix multiplications, have 
lead to 
the appearance of frameworks designed for this kind of massive, parallel 
computations and supported by large companies such as Google's 
TensorFlow\cite{tensorflow2015-whitepaper} (TF) 
or Facebook's PyTorch\cite{paszke2017automatic}. With large economical 
interests 
coming 
into 
play, these frameworks also focus on the efficient use of specialized 
hardware, such as 
Graphical Processing Units (GPU) which are by design optimized for vectorized 
computations and therefore fit the need of the Deep Learning community. These 
frameworks are optimized both in terms of performance as well as in ease of 
use, dealing with the burden of incorporating the parallelization.

Next to all these developments there is also a trend inside the HEP community 
to move towards a more Python-oriented software stack. In recent 
surveys\cite{hep_survey_jim}, its usage surpasses C++ in
collaborations such as \cms. The existence of the scientific Python ecosystem 
offers the possibility of sharing some of the 
effort with the data analysis industry and open-source community, allowing
to 
perform a significant number of the analysis steps in HEP within Python out of 
the box. This 
leaves 
to the HEP community only the burden of developing the field-specific tools 
required to fit into the ecosystem. While some of the existing frameworks in 
C++ offer Python 
bindings, they are usually not well integrated with the Python language and the 
whole ecosystem. Several model fitting libraries in pure Python have already 
been developed in order to fill parts of this gap, though none of them offers 
the complete feature set that would be desired for HEP analysis and are hard to 
extend. Therefore, while large advances have been made on this 
front, a viable alternative to the existing, mature model fitting libraries in 
C++ is still missing. Nonetheless, some of these libraries have proved the 
feasibility of using deep learning frameworks as computing backends for model 
fitting.

\vspace{5mm}

Summarizing, HEP has

\begin{itemize}
	\item the need for scalable, flexible model fitting;
	\item a strong movement towards Python with its huge data analysis
	ecosystem;
	\item the lack of a sufficiently strong model fitting library in pure 
	Python.
\end{itemize}

Furthermore, modern high performance computing frameworks from deep learning 
arose and their feasibility for computational backends in model fitting was 
demonstrated in several projects.

With these ideas in mind, the \zfit{} package has been developed with the goal 
to provide this need by 
creating a 
pure Python based library built on a deep learning framework. This requires the 
formalisation of the fitting procedure, the establishment of a stable API and 
the usage of current knowledge from similar libraries. The following Section 
\ref{sec:modelfitting} will expand on model fitting in HEP incensing a 
discussion on
already
existing libraries. With this knowledge, the 
usage of \zfit{} and its basic concepts, the formalisation of the model 
fitting workflow, and the choice of the backend and its capabilities are 
outlined in Sec. \ref{sec:quickstart}. 
The individual components of \zfit{} will be discussed in more detail in 
Sec. \ref{sec:parts}. Afterwards, the performance and scalability is evaluated 
with examples in Sec. \ref{sec:performance}. The extension of \zfit{} to more 
than the default model fitting is discussed 
in Sec. \ref{sec:beyond standard fitting} by using its capabilities to 
implement an 
amplitude 
fit. Lastly a brief overview of the future plans of the \zfit{} library and 
its ecosystem are given in Sec. \ref{sec:conclusion}.