\section{Model fitting} \label{sec:modelfitting} \label{sec:theory} In HEP, observations can be quantified by mathematical models which originate from hypotheses or theories and make assumptions about the underlying behaviour of nature. Often, these models have free parameters that we want to measure. A single model may describes only parts of the observations and combinations and compositions of models may be needed to build a model that describes the full data sample. Creating these models in a convenient and correct way and finding the values of the parameters to maximise the agreement with respect to the data is what ``model fitting" refers to. \subsection{Maximum Likelihood} At the very heart of model fitting is the need to quantify the agreement, or rather the disagreement, of a model with the data. This function of the parameters and data is known as the loss. It is the very definition of the problem and mathematically fully defines the solution. In HEP analysis losses are mostly based on the likelihood of the model under the data, whereby the model typically depends on free parameters. In the following, an introduction to the method of maximum likelihood is given. A more detailed explanation and derivation can be found in Appendix \ref{appendix:likelihood}. A likelihood can be defined by the following: given a model parametrised by $\theta$ and a dataset $x$, the likelihood describes the odds that an event happened under $\theta$ \begin{equation} \label{eq:likelihood} \mathcal{L}(\theta) = P(x | \theta). \end{equation} The likelihood as shown in Eq. \ref{eq:likelihood} is the quantity to be maximised in order to achieve the maximal $P(\theta | x)$. To build this likelihood, we need the model $f_{\theta} (x)$ to be a probability density function (PDF), i.e. it's normalised to $1$. Especially in HEP, it is often the case that the PDF is zero outside of certain boundaries, for example because points outside a specified domain are removed, in which case \begin{equation} \label{eq:pdf} \int_{l}^{u} f_{\theta} (x) \mathrm{d}x = 1, \end{equation} where $l$ and $u$ define the lower and upper boundaries of the domain, respectively. This also extends to higher dimensions. It follows directly that any function $g_{\theta} (x)$\footnote{This is about the small subset of modelling functions in physics \textit{without} pretending mathematical correctness in a general way. This includes functions $f: \mathbb{R}^n \rightarrowtail \mathbb{R} $ that are positive, $l^1$ and (piecewise) $C^1$.} can be normalised and therefore used as a PDF $f_{\theta} (x)$ \begin{equation} \label{eq:pdf from func} f_{\theta} (x) = \frac{g_{\theta} (x)}{\int_{l}^{u} g_{\theta} (x) \mathrm{d}x}. \end{equation} A likelihood can be a product of likelihoods of independent events \begin{equation} \label{eq:likelihood_from_products} \mathcal{L} = \prod_{i} \mathcal{L}_i, \end{equation} and therefore the likelihood of dataset $x$ can be written as the joint probability of each event \begin{equation*} \label{eq:likelihood joint probability} \mathcal{L}(x | \theta) = \prod_{i} f_{\theta} (x_i), \end{equation*} with $x_i$ a single event from the dataset $x$. The calculation of $\mathcal{L}(x | \theta)$ involves the product of many small numbers, which is not possible to perform using a normal computer given its limited precision. To solve this issue, a log transformation can be applied. In addition, the log-likelihood is usually negated, thus changing the target of finding the maximum to finding a minimum and ending up with a negative log likelihood (NLL). A maximum likelihood estimate using the transformation above is therefore given by finding the minimum of the NLL \begin{align} \label{eq:nll} NLL = - \sum_{i} \ln(f(\theta|x_i)) \end{align} This maximises therefore the agreement between data and model, i.e. the \textit{probability of the model given the data}. As seen in Eq. \ref{eq:likelihood_from_products}, the combination of likelihoods is quite versatile and not only limited to a model shape matching the data shape. Often, a combination of several of the following likelihoods is built \begin{description} \item[Simultaneous] Multiple models can share parameters. To fit them simultaneously to different datasets, their likelihoods can be combined (summed). \item[Extended] While a PDF is normalised, we can add an absolute scale as an additional term to the likelihood to reflect the number of events contained in this model. Given the data, we know the number of events and can add a Poisson term to account for them. \item[Prior] For some parameters, a prior distribution is known. This describes the knowledge obtained from other measurements and influences the likelihood if the parameters spread is in the same order of magnitude as the sensitivity of the fit to the parameter. A prior, or constraint, is a probability depending directly on the parameter value and can also be added to the likelihood. \end{description} Regardless of the complexity of the model, we end up with a single number, the loss, that can be used to compare the agreement between different models or parametrizations and the data. When fitting a model, the loss is minimised by adjusting the parameters. While the absolute value of the loss is usually not important, the ratio of losses from different models can often be useful in further statistic tests.