\section{Implementation} \subsection{Spaces definition} \label{appendix:spaces defined} A \zspace{} is either initialized through observables \textit{or} axes and \textit{maybe} also has limits. It \textit{can} have both, observables and axes, which means there is an order-based, bijective mapping defined between the two. In general, a \zspace{} is immutable and adding or changing anything will always return a copy. When a user creates a \zspace{}, observables are used and define with that the coordinate system. Once a dimensional object, as a model or data, is created with a \zspace{}, the order of the observables matters. Since the \zspace{} at this point only has observables and does not yet have any axes, when the dimensional object is instantiated, the axis are created by filling up from zero to $n_{obs} - 1$. This step is crucial and defines the mapping of internal axes of this dimensional object to the externally used observables. In other words, every dimensional object has \textit{implicitly} defined axes by counting up to $n_{obs} -1$, assigning observables creates a mapping by basically enumerating the observables. For example we assume a model was instantiated with a \zspace{} consisting of some observables and data with the same observables but in a different order. Now the assignment of observables to the model and the data columns are fixed, therefore it is well defined how the data has to be reordered if it is passed to the model. While we used data and a model in the example above, the same is true for limits that can be used to specify the bounds of integration and more. Since limits can be part of a \zspace{}, the reordering is done automatically if the order of the observables or axes is changed, not in-place but in the returned copy of it. To help with the accounting of dimensions, \zspace{} can return a copy of itself with differently ordered observables. Internally of the \zspace{}, the axes, if given, and the limits, if given, are reordered as well. This is crucial in input preprocessing for any dimensional object since with that each \zspace{} is ordered in the same way as the observables of that object. A subspace can be created from a \zspace{}. This is a subset of the dimensions of the \zspace{}. It can be used for example if a model is composed of lower dimensional models. This is often the case for functors such as a product PDF as described in Sec. \ref{model:functors}. \subsection{General limits} \label{appendix:general limits} Simple limits are just tuples. But a more general format is needed to express multiple limits and higher dimensions in a straight forward way. Multiple limits are technically done with multiple tuples of lower and upper limits for each observable. The \zspace{} is handled as \textit{one} domain. So the area of a \zspace{} is the sum of all the areas of each simple limit, the integral over a \zspace{} is the sum of integrals over each simple limit and so on. Multiple limits are defined separately and are not built from the projections of all limits. The format is therefore to specify the lower limits and the upper limits in each dimension as a tuple. Multiple limits contain multiple lower limit tuples and multiple upper limit tuples. As an example, a \zspace{} is created with the observables \texttt{x, y} and the two limits $l_1$ and $l_2$ \begin{align*} l_1 &= (x_0, y_0) ~to~ (x_1, y_1) \\ l_2 &= (x_2, y_2) ~to~ (x_3, y_3) \end{align*} to write the limits in the right order, the upper and lower limits have to be separated and concatenated, so that the lower and upper limits look like this \begin{align*} lower &= ((x_0, y_0), (x_2, y_2)) \\ upper &= ((x_1, y_1), (x_3, y_3)) \end{align*} By definition of the format, $lower$ and $upper$ have to have the same length. The limits for the \zspace{} is the tuple $(lower, upper)$. Creating this \zspace{} with \zfit{} is done as follows \newline \begin{center} \begin{minipage}{\textwidth} \begin{python} lower = ((x0, y0), (x2, y2)) upper = ((x1, y1), (x3, y3)) limits = (lower, upper) multiple_limits = zfit.Space(obs=["x", "y"], limits=limits) \end{python} \end{minipage} \end{center} The same format is returned by the property \pyth{Space.limits}. This is a quite general format that covers the needs for rectangular shaped limits. However, more advanced shapes may be necessary, see also \ref{sec:dalitz}, which will most probably be provided in the future. Since the order of observable matters and \pyth{limits_xy} and \pyth{limits_yx} as used in Sec. \ref{sec:spaces and dimensions} define the same domain (apart from the order of the axis), they can be converted into each other using the \pyth{Space.with_obs} method \begin{center} \begin{minipage}{\textwidth} \begin{python} limits_xy_resorted = limits_yx.with_obs(limits_xy.obs) limits_xy == limits_xy_resorted # -> True \end{python} \end{minipage} \end{center} Notice that this created a new \zspace{} and left \pyth{limits_yx} untouched. This method can also be used to only select a subspace \begin{center} \begin{minipage}{\textwidth} \centering \begin{python} limits_x == limits_xy.with_obs("x") # -> True \end{python} \end{minipage} \end{center} and therefore go to lower dimensions again. \subsection{Data formats} \label{appendix:data formats} Currently, the following formats can be read by \zdata. \begin{description} \item[ROOT] The standard file format used in HEP analysis. It efficiently stores data samples and is the native format of the \root library. Due to the recent development of \texttt{uproot}\cite{software:uproot}, it is possible to load these files in Python \textit{without} an installation of \root. \item[Pandas DataFrame] Pandas\cite{mckinney-proc-scipy-2010} is the most extensive data container in Python used for data analysis. It provides DataFrames that offer an extensive set of data analysis tools going from plotting to feature creation and selections. It is the de-facto standard in Python and has the ability to load from a variety of data formats including hdf5, csv and more. The possibility to load them directly into \zfit{} is therefore a powerful feature because it allows to do any preprocessing in DataFrames. \zdata{} can also be converted \textit{to} DataFrames, which allows to load for example from ROOT files into \zdata{}, convert to a DataFrame, apply some preprocessing steps and then load again into \zdata{}. \item[Numpy] Numpy\cite{software:numpy} is the standard computing library in Python that has been around since a long time. Several libraries, including TF, are inspired by its API and behaviour design. The numpy arrays are the default way to handle any vectorized data and are also returned by TF as a result of computations. \item[Tensors] \zdata{} can also take a pure Tensor as input. While this may seem at first glance the obvious thing to do, it is trickier: a Tensor is, compared to the other data types \textit{not} fixed per se, since it is only an instruction to compute a certain quantity. While constants for example behave straight forward and will always return the same, a \zdata{} initialized with a random Tensor will produce different data every time it is called. Therefore, special care has to be taken for this case, from the developer as well as from the user site. \end{description} \subsection{Data batching} \label{appendix:data batching} Small datasets are internally simply converted to a Tensor and attached to the graph. Large datasets though, which either exceed the memory limit of the computing device or the limit of the graph size (which is 2 GB), need to make use of batched out-of-core computation. TF has a data handling class \pyth{Dataset}, which provides a performant way to do batched computations. It incorporates the loading of the batches from disk into the whole graph as several operations. This allows the runtime to split the execution in order to asynchronously load a batch of the data and run the graph of an already loaded data batch. Another way of on-the-fly computation can be more generally be done with Tensors, since they can be used to instantiate \zdata{}. For example, \zdata{} has a subclass \texttt{Sampler}, which is specialised on this. It allows to evaluate the Tensor and store its value. This way, the Tensor is only re-evaluated when requested. The \texttt{Sampler} acts for example as the returned \zdata{} when sampling from a model. This data depends on the model but can be used like a normal data for example to construct a loss. \subsection{Dependency management} \label{appendix:dependency management} The graph built by TF can be fully accessed, the parent operations of any operation is available. This enables to detect any dependency by walking along the graph. Or as TF does internally, to create the gradient. Using this in general can be risky though since for example caching with a \pyth{Variable} changes how the \textit{actual} graph looks like. Furthermore, it is also time consuming on a larger scale. Inside \zfit{}, it is used as an additional feature to figure out dependencies automatically of certain subgraphs. There is one type of independents that can change as also described in \ref{sec:parametrization}, \texttt{Parameter}, that other objects can depend on, directly or indirectly. To have a dependency structure that is independent of the graph, \zfit{} has a \pyth{BaseDependentsMixin}. A subclass implements a method of returning the dependents for itself. This can then be done recurrently up to the independent parameters (see also \ref{sec:parametrization}) that return themselves. Any major base class implements the appropriate functions but requires for example to have a distinction between a model that depends only on parameters or also on other models. Both of them are \texttt{Dependents} but the model has to be aware to not only extract dependents from the parameters but also from the models. To get the dependents from any object, \pyth{get_dependents} can be used, which returns a set of independent \texttt{Parameters}. This is fundamentally different from the \pyth{params} that each model has. The former will return all the \textit{independent} parameters that the model depends on. This can be any number of \textit{independent} parameters. The latter returns the exact parameters that are used in the function defining the shape of the model. For fitting, when tuning the parameters, the \pyth{get_dependents} should be used, since the actual changeable parameters matter. When reading off a value from a model, like the mean by accessing \texttt{mu}, the \pyth{params} has to be used. Using the latter for fitting can easily result in an error if not all of the parameters are independents, since the value of dependents (including constant) parameters \textit{cannot be set}. \subsection{Base Model} \label{appendix:basemodel} From all the classes, the \texttt{BaseModel} and therefore also its subclasses, \texttt{BaseFunc} and \texttt{BasePDF}, contain the most logic. The implementation has a few peculiarities that will be highlighted. It is meant to be used as a base to implement custom models providing flexibility but ease of use at the same time, as discussed in Sec. \ref{sec:concepts}. Therefore in this class a structure is provided where everything can be directly controlled \textit{but does not have to be}. The class takes care of anything that is \textit{unambiguous} but maybe cumbersome to do while leaving the full control to the user. The intended usage as a base class leaves it to the user to implement himself any method he wants to control directly. Namely the class provides the guarantee that any of the main methods can be overriden by changing the implementation of \texttt{\_method}, the same method name but with a leading underscore. Furthermore, any direct control can always be given back to the class by simply raising a \texttt{NotImplementedError}. The class acts as if the overriden method was never called. Furthermore, the base class takes care of some unambiguous handling of arguments (see below e.g. \ref{sec:norm_range_handling}, \ref{sec:multiple_limits_handling}). It is mandatory to decorate each \text{\_method} with the \texttt{supports()} decorator. This allows to specify if the method can handle certain things, like normalization ranges. By default, this filters the more advanced arguments and handles them automatically. It is meant to provide a way to still allow specific implementations and workarounds. \subsubsection{Public methods} \label{sec:public_methods} The internal logic of the methods \pyth{pdf} (and to some extend \pyth{func} and \pyth{unnormalized_pdf}), \pyth{sample} and all the \pyth{integrate} have a very similar layout. Their logic is split into a public part and more internal methods. We'll start with the outermost method, the public methods, and follow the subsequent method calls. There is a strict order on what will be called after which method, we follow the same order here. A public method is a method starting without an underscore. It serves as the entry point to a model, asking it to do something. It is supposed to provide a clean API to the user as well as appropriate documentation of the method. The functional responsibility of the method is to clean the input which mostly means to take care of the ordering of dimensions and automatically convert certain input to a more general format. The following cleaning is done to the input \begin{description} \item [limits] Limits for norm\_range or for sampling and integration can be given as arguments to certain methods. First, the limits are automatically converted to a \zspace{} \textit{if it's save to do so}. Namely a simple limit consisting of a tuple in a one dimensional case is converted, anything else raises an error. Second, the method takes care of sorting the space in the right way. Since each model has a \zspace{} with observables assigned to it with a given order, the given or auto converted \zspace{} is sorted accordingly. This assures that the limits are internally in the right order. \item[data] As for limits, the data is first converted to the right format, a \zdata{} \textit{if possible}. This will convert any input data that coincides with the dimensionality of the model. While this is convenient since it allows to directly feed numpy arrays and Tensors to a model, it relies on the correct order of both the data as well as the model observables. Since this can silently lead to mistakes by using the wrong order, probably in the future this won't be possible anymore, given that a simple conversion to \zdata{} can always be made. As a second step, the \zdata{} is sorted according to the observables of the model, just like the limits. This ordering is done using context managers that revert the reordering once the data exits the method. \end{description} The ordering is a crucial element to allow the direct usage of any object inside a model while having the matching ordering guaranteed. \subsubsection{Hooks} After the public method, two hooks follow. Assuming we have a public method named \texttt{\_method}, the first hook is named \texttt{\_single\_hook\_method}. Its purpose is to be directly called from the public method \textit{or} if the model itself needs to call one of it's own methods. The second hook called by the first is \texttt{\_hook\_method} and is used for repeated calls by the very same method that was called. This is useful if for example a scaling factor is supposed to be applied at the end of the calculations. If a method calls itself recursively, the scaling is supposed to be applied just once at the very end. The general idea of hooks is to provide a convenient way to change the behaviour of a method \textit{without} altering the public method and its input cleaning. It provides the user the possibility to directly change any input before it is passed further down or any output right before it is returned. So any kind of advanced, model specific pre- or post-processing or setting of internal flags can be handled here. Some implementations inside \zfit{} make use of this. For example the implementation of the exponential shape uses the exponential-shift trick\footnote{A normalized exponential shape is invariant under translation. This can be used to increase the numerical stability by avoiding large number calculations and keep it around zero.}. This requires to determine the shift before the actual computation method is called. Whatever modification is applied in a hook, it is important that any hook always calls it's parent method to allow stacking hooks. \subsubsection{Norm range handling} \label{sec:norm_range_handling} Following the hooks, \texttt{\_norm\_method} is invoked. It only exists if a normalization range is used inside the function. This methods responsibility is to automatically take care of the \textit{normalization logic} if the underlying function does not handle it itself. A \texttt{NormRangeNotImplementedError} can be raised by any deeper nested function which is caught here. For example in the case of an integral, the method first calls the subsequent method and if it catches an error, it splits into two calls: it calculates two integrals, each one \textit{without} a normalization range, one over the limits to be integrated over and one over the normalization range. The first is then divided be the second, which is the very definition of a normalized integral. To illustrate this behaviour in pseudo code, a function \pyth{integrate} that takes the limits and the normalization range as arguments is assumed to exist. \pyth{False} as argument to the normalization range means the calculated integral is unnormalized. \begin{center} \begin{minipage}{\textwidth} \begin{python} try: integral = integrate(limits, norm_range) # pseudo method that integrates except NormRangeNotImplementedError: integral_unnormalized = integrate(limits, False) normalization = integrate(norm_range, False) # integrate over norm_range integral = integral_unnormalized / normalization \end{python} \end{minipage} \end{center} This allows for a user to implement only a simple integral when overwritting or registering without the need to worry about its normalization. For most integrals this would anyway end in two times calling the integration function itself, basically what was done above. The default behaviour is that the normalization range will be automatically handled, as described in Appendix \ref{appendix:basemodel}. But it still leaves room for the possibility to implement a method that handles the integral as well as the normalization of it. There are special cases where this can be achieved with less computations than two times calculating the integral, such as in the case where the normalization range equals the limits. \subsubsection{Multiple limits handling} \label{sec:multiple_limits_handling} Next in the call sequence the method \pyth{_limits_method} is invoked. It has the responsibility of handling multiple limits which are described in \ref{sec:multiple limits}. Equivalently to the way norm\_range is handled in \ref{sec:norm_range_handling}, multiple limits are caught here as well. Consecutive methods are called inside a \texttt{try-except} block in order to catch any \texttt{MultipleLimitsNotImpementedError}. If handling multiple limits is well defined, \pyth{_limits_method} is supposed to take care of it. As an example, the implementation of integration with a limit of \textit{n} limits is to split them into \textit{n} independent \zspace s, each with only one limit, and call the following methods with this new \zspace{}. It is in then a simple matter of summing the results. Given some multiple limits as a \zspace{} to be integrated over by invoking the pseudo \pyth{integrate} function, the implementation looks like this \begin{center} \begin{minipage}{\textwidth} \begin{python} try: integral = integrate(limits) except MultipleLimitsNotImpementedError: integral = 0 for limit in limits.iter_limits(): integral += integrate(limit) \end{python} \end{minipage} \end{center} \subsubsection{Most efficient method} In \pyth{_call_method}, the actual functions are invoked. There are usually several choices for which function to invoke depending on availability and efficiency. The order chosen therefore starts with the most efficient implementation. If that raises a \pyth{NotImplementedError}, the next method is called. \begin{itemize} \item First the \pyth{_method} is called and returned if successful. This is a method that by default raises a \pyth{NotImplementedError} but can be overwritten by the user. It is guaranteed to be executed in this case making the public \pyth{method} call seem like a call to \pyth{_method}, module input cleaning and limit handling. This asserts the full level of flexibility in a model: any major function can completely be overwritten by this procedure. \item If no explicit method is implemented, closely related alternatives are invoked. For example, if \pyth{log_pdf} was called and \pyth{_log_pdf} is not overwritten, \pyth{_pdf} is invoked and if implemented, its logarithm is returned. As an other example, if \pyth{integrate} was called and \pyth{_integrate} is not implemented, an analytic integral calculation is performed. If that is not available and also raises an error, more expensive alternatives are called. \item The freedom of \pyth{_call_method} is to try simple, but maybe failing alternatives to return the desired. When all of the above fails, the \pyth{_fallback_method} is invoked. This is the last resort and may be quite a complex function. The calculations that it returns may be expensive but the method is guaranteed to work. It must not fail if the \zmodel{} is implemented correctly. \end{itemize} \subsubsection{Functors} \label{model:functors} \label{appendix:functor} To implement a function just depending on data, the normal base class as described above is sufficient. But a model can also depend on other models. Creating compositions, as for example the \pyth{SumPDF} from Sec. \ref{sec:quickstart}, requires an additional tweak regarding the dependencies: a functor does not only depend on its own parameters but also on the sub models parameters associated with it. This is automatically taken care of by having a functor base class. Compared to normal models, they often don't need to define their observables but are inferred from the sub models. \subsection{Sampling techniques} \label{sec:sampling techniques} \label{appendix:sampling techniques} Implementing a sampling from an arbitrary distribution is not a straight forward task. The generation of uniformly distributed numbers is comparably simple and is used as a basis for any more advanced sampling technique. Two major ways are often used to sample from a distribution. If the inverse analytic integral function is known, then this can be used to transform a uniform distribution to the desired distribution by treating it as the sample on \text{y}. The inverse returns values proportional to the target distribution. This method is very fast and efficient since every drawn event is used. The downside is that it requires the inverse analytic integral, which is not available for most shapes, especially custom ones. For a model where only the shape and no integral is known, the accept-reject method can be used. Thereby, samples are randomly generated in the model domain and evaluated. A random number is drawn between 0 and the maximum of the target shape for each event. If the value returned by the model is larger than the random number, the event is accepted. Otherwise it is rejected. The technique is illustrated in Fig. \ref{fig:accept_reject}. \begin{figure}[tbp] \begin{subfigure}{.5\textwidth} \includegraphics[width=\textwidth]{figs/accept_reject.png} \caption{} \end{subfigure}% \begin{subfigure}{.5\textwidth} \includegraphics[width=\textwidth]{figs/accept_reject_importance.png} \caption{} \label{fig:imporance_accept_reject} \end{subfigure}% \caption{Visualization of the accept reject method. Proposed events are randomly sampled in the valid range. In a) a uniformly sampled y value and in b) a Gaussian shaped y is used to either accept or reject them. The black like is the true shape of the model. The orange line represents the distribution the y were drawn from. Blue values are accepted, red are rejected.} \label{fig:accept_reject} \end{figure} This can be very inefficient though for peaky distributions since all red values are lost. An increase in efficiency can be achieved by sampling from a distribution that follows better the target shape, so called ``importance sampling", as shown in \ref{fig:accept_reject} on the right. It needs to take a little bit more into account, namely \begin{description} \item[target] This is the distribution we want to approximate. \item[probability] The target probability is the function value of the target shape at a given x. While called probability, this does not have to be normalized to anything but be proportional to a real probability. \item[proposal] The proposed sample is the events that will be either accepted or rejected. They are drawn from the proposal distribution. \item[weights] This is the probability (or proportional to it) of an event from the proposal being drawn from the proposal distribution. \item[rnd] A random number drawn uniformly between 0 and 1 to decide whether to accept or reject an event. \end{description} To approximate the target, a sample is drawn. To accept or reject samples, the following is checked \begin{equation} accept = prob_i < weight_i \cdot rnd \end{equation} This is only unbiased if $all(prob_i < weight_i)$. Otherwise the distribution will be misshaped as shown in Fig. \ref{fig:importance_biased} \begin{figure}[tbp] \centering \includegraphics[width=0.5\textwidth]{figs/importance_gone_wrong.png} \caption{Importance sampling with a wrong scaled weight. The sampled Gaussian distribution (blue) is cut on the top and does not resemble the correct shape.} \label{fig:importance_biased} \end{figure} Therefore to be unbiased, the weights have to be scaled enough. This can though lead to problems if the weight is significantly lower (even if only in a single point) than the target probability. Since this requires a large rescaling, the sampling of the rest of the target gets rendered inefficient. Therefore it is important that the proposal distribution matches the target reasonably well. Most importantly the maximum and minimum of the ratios of the two distributions should be as close as possible. \subsection{Loss defined} \label{appendix:loss defined} The following parts are provided by a loss to enable minimisation \begin{description} \item[value] The actual value of a loss is needed since it is the desired object to be minimised. The method \pyth{value} returns the Tensor that can be run with \pyth{zfit.run(...)}. This Tensor contains typically the heavy computations. \item[parameters] The loss depends on one or more parameter that is floating. They are associated with models and change the shape thereof. As for a model, all dependents can be retrieved by using \pyth{get_dependents}. Changing their values affects the value of the loss. \item[gradients] Calling \pyth{gradients} returns the gradients of the loss with respect to the parameters. Gradients provide a helpful tool for the minimisation algorithm. \end{description}