Master_thesis/thesis/introduction.tex at a61f2fdf0ee1393619868f59e6208e0690aabafd

Fork: 0

saslie / Master_thesis

Find file

Newer

Older

Master_thesis / thesis / introduction.tex

Sascha Liechti on 30 Sep 2019 11 KB Minor updates

Raw Blame History

\section{Introduction}
\label{sec:Introduction}

The Standard Model (SM) of Particle Physics describes the most fundamental
particles in the universe and their interactions. According to it, all matter
is made up of fermions: quarks and leptons. They appear in different flavours
and generations as depicted in Fig. \ref{fig:sm}.
There are additionally four gauge bosons that allow
the particles to interact via their exchange: the photon is the
electromagnetic force carrier and couples to particles with an electric charge,
such as the electron and the quarks. W and Z bosons are the carrier of the weak
force and couple to all fermions. Unlike other forces, the strength of the
strung interaction
increases with the distance of two particles. As a consequence, quarks do not
appear alone in nature. Instead,
they form composite particles existing of multiple quarks, the hadrons. While
there are hundreds of hadrons, nearly all of them have
lifetimes under nano seconds and decay to lighter particles. The only
stable particles are the well known neutron, proton,
electron and neutrino, which together make up the visible matter in our
universe. Finally, all particles in the SM (except neutrinos) acquire mass
through their interaction with the Higgs Field by the exchange of Higgs Bosons.

\begin{figure}[bp]
\centering
\includegraphics[width=0.5\textwidth]{figs/sm_overview.png}
\caption{The particles of the SM.}
\label{fig:sm}
\end{figure}

With the recent discovery of the Higgs boson, the last missing piece of the SM
has been found. It provides a complete description of nearly all
observations. And yet it does not seem to be the final answer since there are
phenomena that remain unexplained, such as the existence of dark matter
that interacts gravitationally and significantly determines the dynamics of
galaxies does not appear in the SM, and the fact that neutrinos have mass
since they oscillate does not coincide with the predictions of the SM. With
larger amounts of data collected, more precise measurements need to be made in
order to look for further inconsistencies of the SM that can guide us to a new
theory.

In the scientific context, what is called an observations is in fact an answer
extracted from nature by
asking the right question and using statistics to analyse the response of the
data. A question in this sense is an
experimental setup and a scientific hypothesis is
a proposed
explanation for an observed phenomena which can be tested. Different methods
can be used to verify a hypothesis, but all of them make use of a test
statistic that needs a single value to quantify their agreement with the
observations. Given strong enough
evidence, the null hypothesis may be rejected in favour of an alternate
hypothesis.
As an example, this can be used for the discovery of a new particle where the background only hypothesis acts as the null and the alternate is the background and signal combined.

As already mentioned, observations from experiments are needed. To study the
fundamental particles of the SM, high enough energies
are
required to produce them. High Energy Physics (HEP) experiments all over the
world accelerate light particles such as electrons or protons and let them
collide.
The Large Hadron Collider (\lhc) at \cern accelerates protons to energies up to
6.5\tev\footnote{Natural units
with $\hbar=c=1$ are used throughout.}, which is currently the frontier in
high energies.
Around the collider,
there are four large experiments: the general purpose detectors \atlas and
\cms, \alice, an experiment specialized on lead interactions and \lhcb,
a detector focused on
the study of heavy flavour decays. These experiments are situated at collision
points around the
\lhc where 40 million collisions occur per second.
Due to the high concentration
of energy on collision, heavier particles are created and decay immediately to
lighter particles. The experiments measure the tracks and properties of the
decay products that pass through the different detectors. The raw readout of
those is forwarded to a computer
farm, where events of
interest are marked and kept whereby the rest of the events are not recorded.
This
reduces the stream of data to a frequency that allows it to be
stored on persistent storage. From there it can be retrieved and used for
further, offline
analysis.

Performing a full analysis to measure physics observables from the data
involves
several steps. This includes among others cleaning the samples by applying
selection criteria, reweighting to correct for systematic effects or creating
new features that better describe the event. The sample can then be used to
directly infer
unknown parameters by using physically motivated models and performing a fit to
the data. All of these analysis steps require convenient, reliable and fast
libraries together with enough computing resources. To accomplish this, a lot
of
code is being written and stacked upon each other. To cope with the ever
increasing amounts of data, both in real-time event filtering as well as
offline analysis, it is
mandatory to keep the computing level all in all at the state of the art.

Computing is still a comparably young, fast moving field. Hundreds of general
programming languages exist, most of them do not stay for a long time or only
exist in a specialist community. Even longer lived languages have both
advantages as well as shortcomings. This leads to different fields adopting a
few, or even one, main languages that specifically well serve their purpose.
The field of scientific computing mainly involves either simulation of systems
or data analysis. In both cases languages that are fast to do number crunching
are required.
Among the most popular languages for heavy computations are Fortran and C/C++.
The former is over six decades old and still used up to these days. It was
designed for numerical processing and contains optimizations that still
outperform other languages. C and its superset C++, the most
popular language in HEP, date back about four
decades
and are built for more general usage than Fortran is. Although
being fast and a powerful general
programming
language, it does not offer the convenient abstractions that
scripting languages offer. It allows but also requires to manually handle
certain resources, such as
memory allocation, and is limited in terms of flexibility because it is a
statically compiled language. While the latter feature allows for high
performant execution, interpreted scripting languages such as Python offer
additional comfort and flexibility. While the execution of actual Python code
can be
significantly slower than comparable static compiled languages when it comes to
pure number crunching, the huge Python package ecosystem offers a lot of
libraries that implement time consuming mathematical operations in a more
efficient language, such as Fortran or C++. This makes Python a higher level
library that abstracts away the handling of computation demanding
operations through external function calls. Together with an
especially clean syntax style, Python code is expressive and natural to read,
giving in only a small penalty for the performance from the overhead of the
external calls.

This combination and a fast growing open source-community has established
Python as the most popular language for data analysis. With the recent advances
in Machine Learning and the rising popularity of Big Data analysis among
industry leaders, the size and quality of the scientific Python ecosystem has
made a huge leap forward. Topics like deep learning, which require highly
optimised code due to the abundance of vectorised matrix multiplications, have
lead to
the appearance of frameworks designed for this kind of massive, parallel
computations and supported by large companies such as Google's
TensorFlow\cite{tensorflow2015-whitepaper} (TF)
or Facebook's PyTorch\cite{paszke2017automatic}. With large economical
interests
coming
into
play, these frameworks also focus on the efficient use of specialized
hardware, such as
Graphical Processing Units (GPU) which are by design optimized for vectorized
computations and therefore fit the need of the Deep Learning community. These
frameworks are optimized both in terms of performance as well as in ease of
use, dealing with the burden of incorporating the parallelization.

Next to all these developments there is also a trend inside the HEP community
to move towards a more Python-oriented software stack. In recent
surveys\cite{hep_survey_jim}, its usage surpasses C++ in
collaborations such as \cms. The existence of the scientific Python ecosystem
offers the possibility of sharing some of the
effort with the data analysis industry and open-source community, allowing
to
perform a significant number of the analysis steps in HEP within Python out of
the box. This
leaves
to the HEP community only the burden of developing the field-specific tools
required to fit into the ecosystem. While some of the existing frameworks in
C++ offer Python
bindings, they are usually not well integrated with the Python language and the
whole ecosystem. Several model fitting libraries in pure Python have already
been developed in order to fill parts of this gap, though none of them offers
the complete feature set that would be desired for HEP analysis and are hard to
extend. Therefore, while large advances have been made on this
front, a viable alternative to the existing, mature model fitting libraries in
C++ is still missing. Nonetheless, some of these libraries have proved the
feasibility of using deep learning frameworks as computing backends for model
fitting.

\vspace{5mm}

Summarizing, HEP has

\begin{itemize}
\item the need for scalable, flexible model fitting;
\item a strong movement towards Python with its huge data analysis
ecosystem;
\item the lack of a sufficiently strong model fitting library in pure
Python.
\end{itemize}

Furthermore, modern high performance computing frameworks from deep learning
arose and their feasibility for computational backends in model fitting was
demonstrated in several projects.

With these ideas in mind, the \zfit{} package has been developed with the goal
to provide this need by
creating a
pure Python based library built on a deep learning framework. This requires the
formalisation of the fitting procedure, the establishment of a stable API and
the usage of current knowledge from similar libraries. The following Section
\ref{sec:modelfitting} will expand on model fitting in HEP incensing a
discussion on
already
existing libraries. With this knowledge, the
usage of \zfit{} and its basic concepts, the formalisation of the model
fitting workflow, and the choice of the backend and its capabilities are
outlined in Sec. \ref{sec:quickstart}.
The individual components of \zfit{} will be discussed in more detail in
Sec. \ref{sec:parts}. Afterwards, the performance and scalability is evaluated
with examples in Sec. \ref{sec:performance}. The extension of \zfit{} to more
than the default model fitting is discussed
in Sec. \ref{sec:beyond standard fitting} by using its capabilities to
implement an
amplitude
fit. Lastly a brief overview of the future plans of the \zfit{} library and
its ecosystem are given in Sec. \ref{sec:conclusion}.