\section{Machine learning} \subsection{Introduction} \label{ML_Intro} Machine learning has already proven itself to be very successful in resolving many problems in numerous other areas of science and also in the private sector. Based on these promising results, scientists are eager to study the potential of machine learning in physics.\\ There are several sections of machine learning. In this paper, we will focus mainly on neural networks(NN), with special attention to recurrent neural networks(RNN) \cite{connor1994recurrent}, \cite{grossberg2013recurrent} and XGBoost(XGB) \cite{ML:XGBoost} models with boosted decision trees. \subsection{Artificial neural networks} \subsubsection{General concepts} The fundamental concept behind artificial neural networks is to imitate the architecture of the human brain. They can be used for classification problems as well as regression problems. In its most simple form it can be thought of some sort of mapping from some input to some target. For this thesis two neural networks of a special subtype of neural networks, called recurrent neural networks, were used. All of the networks used in this thesis were written in the python library Keras \cite{chollet2015keras} with a Tensorflow \cite{abadi2016tensorflow} backend. In this section the basic principles of neural networks will be explained.\\ A neural network consists of many neurons organized in layers as seen in figure \ref{neural_network_arch}. Each neuron is connected to every neuron in the neighbouring layers, while each of these connections has a specific weight assigned to it.\\ In its most basic form, each neuron calculates a weighted sum to all of its inputs and then applies a bias to it . In addition, each neuron has an activation function, which will be applied at the end of the calculation (see also figure \ref{neuron}): \begin{equation} y = f_{activation} \left(\sum^n_{i=1}(x_i\cdot w_i) + b\right) \end{equation} This is done to create non linearity in the system. Later, some more complex architectures of neurons will be presented.\\ The first layer, also called input layer, is always defined by the number of inputs, with one dimension for each input. The dimensions of the following layers (excluding the last one), which are also called hidden layers, can be chosen to be an arbitrarily number. The number of dimensions of the last layer, also called output layer, is determined by the dimensionality of the prediction. The number of hidden layers, and their corresponding dimension, changes the performance of the system. \begin{figure}[H] \begin{center} \begin{subfigure}{0.8\textwidth} \includegraphics[width=1\textwidth]{img/neural_network.png} \caption{Architecture of a neural network} \label{neural_network_arch} \end{subfigure} \begin{subfigure}{0.5\textwidth} \includegraphics[width=0.8\textwidth]{img/neuron.png} \caption{Neuron} \label{neuron} \end{subfigure} \caption{Neural network architecture} \end{center} \end{figure} There is no way of knowing how many dimensions and layers will give you the best performance, as one can only define general effects of what happens when they are being modified. Generally, increasing the number of layers enables the system to solve more complex problems, while more dimensions make the system more flexible. However, even these general guidelines are to be applied with caution. For example; adding too many layers can cause the system to train exceedingly slow, whilst adding to many neurons with a too small training set can result in overfitting\footnote{When a system performs well on the training set but poorly on the test set}. Depending on the problem and the data given, each has its own optimal configuration. By gaining more experience with NN, people can take better guesses where to start. However, in the end it always results in some sort of systematic trial and error to find the optimal configuration.\\ \subsubsection{Activation functions} The two RNN's used in this thesis both use $selu$'s \cite{klambauer2017self} as well as $tanh$ and $relu$'s as their activation functions. \begin{align} selu(x) = \lambda \begin{cases} x, & \text{if $x<0$}\\ \alpha e^x - \alpha, & \text{otherwise} \end{cases}\\ relu(x) = MAX(0, x) \end{align} \begin{figure}[H] \begin{center} \begin{subfigure}{0.45\textwidth} \includegraphics[width=1\textwidth]{img/selu.png} \caption{Selu, elu activation function} \label{selu} \end{subfigure} \begin{subfigure}{0.45\textwidth} \includegraphics[width=1\textwidth]{img/relu.png} \caption{Relu activation function} \label{relu} \end{subfigure} \begin{subfigure}{0.45\textwidth} \includegraphics[width=1\textwidth]{img/tanh.png} \caption{Tanh activation function} \label{tanh} \end{subfigure} \caption{Activation functions} \end{center} \end{figure} Where $\lambda$ and $\alpha$ are fixed parameters\footnote{For standard scaled inputs (mean $= 0$, stddev. $=1.0$): $\alpha \approx 1.6732$, $\lambda \approx 1.0507$}. Selu's have the advantage of normalizing the output. As a rule of thumb, normalized inputs usually tend to give better results (The output of the neurons are the input of other neurons). Using a $tanh$ was the standard approach for a long time although it has some disadvantages over the other activation functions. This is because its slope becomes really small for large numbers, which slows down training noticeably. \subsubsection{Concepts of training} The neural network is trained with a sample of events. This sample consists of a few input parameters and a training target, which is the value the neural network will be trained to predict. Three important terms for the training of a neural network are epochs, batch size and loss function.\\ An epoch refers to one training iteration, where all of the training samples get used once and the weights and biases get modified to fit the wanted targets better. Usually a system is trained over many epochs until the weights and biases stay approximately constant at their optimal values.\\ Batch size refers to the number of examples that are given to the system at once during the training. Batch size should neither be chosen too small, e.g. small batch sizes train slower, nor too big, some randomness is wanted. Experience shows, that a reasonable batch size usually lies between 10 to 100 examples per batch. It is important to note that by decreasing batch size we make the minimum of the mapping we want to find wider. This makes finding the general area of the minimum easier. However if the minimum gets too wide, the slope gets to small to reach the minimum in a reasonable time. On the other side by increasing the batch size too much, the minimum gets exceedingly narrower and it possible to continuously keep "jumping" over the minimum with every training step performed. \subsubsection{Loss functions} To train the system we need some way to parametrize the quality of our predictions. To account for that we use a loss function. A loss function takes the predicted values of the system and the targeted values to give us an absolute value of our performance. There are various loss functions. In the two RNN's "mean squared error"(MSE, formula \ref{MSE}) and "binary crossentropy"(BC, formula \ref{BC}) were being used. The goal of every NN is to minimize the loss function. \begin{align} L(w,b) = \frac{1}{n} \sum^n_{i=1} (\hat{Y}_i(w_i,b_i) - Y_i)^2 \label{MSE}\\ L(w,b) = - \frac{1}{n} \sum^n_{i=1}\left(Y_i log(\hat{Y}_i(w_i,b_i))+(1-Y_i) log(1-\hat{Y}_i(w_i,b_i))\right) \label{BC} \end{align} With: \begin{itemize} \item $w_i$ are the weights of the system \item $b_i$ are the biases of the system \item $\hat{Y}_i(w_i,b_i)$ are the predicted values of the system \item $Y_i$ are the targets for the predictions \item $L(w,b)$ is the loss over $n$ events \end{itemize} \subsubsection{Stochastic gradient descent} There exist several methods to minimize the loss. The most simple one being stochastic gradient descent(SGD). When performing SGD we can calculate the gradient and just apply it to our weights and biases. By doing this repeatedly, we will eventually end up in a minimum\footnote{It is very possible to also just get stuck in a local minimum}. \subsubsection{Stochastic gradient descent with Momentum} Trainings algorithm working with momentum are basically an improved version of SGD. To circumvent the problem of getting stuck in any minimum, our gradient can build up momentum of the past gradients. This is done by adding a momentum term to the applied changes to the weights and biases. The momentum is an exponentially decaying average over past gradients. This generally trains faster than SGD and has less potential to get stuck in local minima. \subsubsection{RMSProp} Another improved version of SGD is RMSProp. The RMSProp algorithm scales the learning rate of each individual parameter by an exponentially decaying average of the past squared gradients. This has the effect, that the learning rate increases if the past gradients were small to increase step size and vice versa. Additionally the average of the past squared gradients decays exponentially to prevent the step size from getting too small. \subsubsection{Adam} The most commonly used algorithm however is the Adam \cite{chilimbi2014project}, which stands for Adaptive Moment estimation, training algorithm (see formulas \ref{adam_alg}). Is is essentially a combination of Momentum and RMSProp and takes the best of both. It is also the one used to train both RNN's of this thesis as it converges the quickest and most reliable to the global minimum. The algorithm contains two moments. The first moment is an exponentially decaying average of past gradients as in Momentum while the second moment is an exponentially decaying average of past squared gradients as in RMSProp. \begin{center} Initial state: \begin{equation*} V_{d_W} = 0, S_{d_W} = 0, V_{d_b} = 0, S_{d_b} = 0 \end{equation*} On iteration t: \begin{equation} \begin{split} V_{d_W} = \beta_1 V_{d_W} + (1-\beta_1) dW, V_{d_b} = \beta_1 V_{d_b} + (1-\beta_1) db\\ S_{d_W} = \beta_2 S_{d_W} + (1-\beta_2) dW^2, S_{d_b} = \beta_2 S_{d_b} + (1-\beta_2) db^2 \\ V^{corrected}_{dW} = \frac{V_{dW}}{1-\beta_1^t}, V^{corrected}_{db} = \frac{V_{db}}{1-\beta_1^t}\\ S^{corrected}_{dW} = \frac{S_{dW}}{1-\beta_2^t}, S^{corrected}_{db} = \frac{S_{db}}{1-\beta_2^t}\\ W_{t+1} = W_{t} - \alpha \frac{V^{corrected}_{dW}}{\sqrt{S^{corrected}_{dW}}+\epsilon}\\ b_{t+1} = b_{t} - \alpha \frac{V^{corrected}_{db}}{\sqrt{S^{corrected}_{db}}+\epsilon} \label{adam_alg} \end{split} \end{equation} \end{center} With: \begin{itemize} \item $V_{d_W}$, $V_{db}$ correspond to the Momentum part \item $S_{d_W}$, $S_{db}$ correspond to the RMSProp part \item $\epsilon$ this constant is chosen to be very small and only there to prevent division by $0$ (usually $\epsilon = 10^{-8}$) \item $\alpha$: learning rate (needs to be tuned according to the problem) \item $\beta_1$ decay constant of the Momentum part (usually $\beta_1 = 0.9$) \item $\beta_2$ decay constant of the RMSProp part (usually $\beta_2 = 0.999$) \end{itemize} \subsubsection{Decaying learning rate} To counteract the problem of "jumping" over the minimum repeatedly, some NN also use a decaying learning rate during their training. By using this the step size gets smaller with every consecutive step which should in principle result in the step size converging to zero when reaching the global minimum. Most NN's, as well as the two RNN's used in this thesis, usually don't use a decaying learning rate as the Adam algorithm on its own already performs well enough. \subsubsection{Batch normalisation} Another important technique often used in NN is Batch Normalisation \cite{ioffe2015batch}, \cite{cooijmans2016recurrent}. By performing Batch Normalization we normalize and center the input around zero in between every layer of the NN. Batch Normalization has proven to be a potent technique to make NN train faster and even perform better. \begin{figure}[H] \begin{center} \includegraphics[width=1\textwidth]{img/batch_norm.jpeg} \caption{The effects of Batch Normalization on data} \label{batch_norm} \end{center} \end{figure} \subsection{Recurrent Neural Networks} \subsubsection{General concepts} Recurrent Neural Networks(RNN) are subclass of neural networks and are specialised to deal with sequential data structures. There are various applications for RNN's such as speech recognition, music generation, sentiment classification, DNA sampling and so on. Generally normal NN don't perform that well on sequential data. One of the reasons is for example that it doesn't share features learned across different positions in the data\footnote{In our experiment positions of the particles with x,y,z in the detector}. Another problem is that the input and output don't necessarily have to have the same length every time.\\ It is important to note that when using RNN's what the units we called neurons before are usually called cells.\\ RNN's pose a much better representation of the data which also helps reducing the number of variables in the system and hereby make it train more efficiently. \begin{figure}[H] \begin{center} \includegraphics[width=1\textwidth]{img/RNN_general_architecture.png} \caption{General RNN architecture} \label{RNN_arch} \end{center} \end{figure} With: \begin{itemize} \item $ x^{\langle t \rangle}$: Input at timestep $t$ with $T_x$ total steps \item $ \hat{y}^{\langle t \rangle}$: Output at timestep $t$ \item $ a^{\langle 0 \rangle}$: Initial value given to the RNN in the first step \item $ a^{\langle t \rangle}$: Information passed over from the last step \end{itemize} In figure \ref{RNN_arch} the general architecture of a RNN can be seen. Every step of the input data ($ x^{\langle t \rangle}$) gets sequentially fed into the RNN which then generates some output $ \hat{y}^{\langle t \rangle}$ after every step of the input. To share already learned information and features for future steps, $ a^{\langle t \rangle}$ gets passed down as additional input into the RNN for the next step. \subsubsection{Most common architectures} There are two concepts of how the data is fed into the system and three structures of RNN's depending on the input and output of the system.\\ Usually the data is fed into the system step by step just. For problems where note the entire sequence is known already at this is the only way to feed the data into the system.\\ If however the entire sequence is already known at the beginning, e.g. sequence classification, often the information is fed into the system from both sides. Networks with this specific architecture are called bidirectional RNN's \cite{schuster1997bidirectional}. This often increases the systems performance.\\ However as with the first RNN, we wanted to predict particle tracks after leaving the detector, we could only use a one directional RNN as the whole track wasn't available. The second RNN is actually a classifier of the tracks. With the whole information available from the start, it was designed to be a bidirectional RNN.\\ A system has a "many-to-one" architecture, if we have a sequential input but we only care the final output of the system, e.g. classification problems. This is the architecture used for both RNN's. With the same reasoning, if we have sequential inputs and want care about the output generated at each step, e.g. speech recognition, the architecture is called "many-to-many". A "one-to-one" architecture is basically just a regular NN. \subsubsection{Cell types} Besides the basic RNN cell type, which shall not be discussed in detail in this thesis, the two most influential and successful cell types are Long-Short-Term-Memory(LSTM) \cite{gers1999learning} cells and Gated Recurrent Units(GRU) \cite{chung2014empirical}. However in this thesis only LSTM cells will be explained in greater detail as the were the only cells used in the RNN's.\\ GRU's were invented with the intention to create a cell type with a similar performance to the LSTM cell while having a simpler internal structure. By being less complex as an LSTM cell a GRU cell has also less parameters to modify during training which also speeds up training.\\ LSTM cells (see figure \ref{LSTM_arch} have many useful properties such as a forget gate, an update gate as well as an output gate. With this cell type, it is easy to pass down information for the following steps without it being altered in a big way (Long term memory). However, there are also ways built in to update this passed down information with new one (Short term memory). Even though GRU's are gaining more and more traction, LSTM-cells are still widely considered to be the most successful type of cells. \begin{figure}[H] \begin{center} \includegraphics[width=0.8\textwidth]{img/LSTM_cell.png} \caption{Architecture of a LSTM cell} \label{LSTM_arch} \end{center} \end{figure} The math behind the LSTM cell looks as follows\footnote{The notation used is the same as in figure \ref{LSTM_arch}}: \begin{equation} \begin{split} \text{\~{c}}^{\langle t \rangle} = tanh(W_c \left[ a^{\langle t \rangle}, x^{\langle t \rangle} \right] + b_c)\\ \Gamma_u = \sigma(W_u \left[ a^{\langle t \rangle}, x^{\langle t \rangle} \right] + b_u)\\ \Gamma_o = \sigma(W_o \left[ a^{\langle t \rangle}, x^{\langle t \rangle} \right] + b_o)\\ c^{\langle t \rangle} = \Gamma_u \cdot \text{\~{c}}^{\langle t \rangle} + \Gamma_o \cdot tanh(c^{\langle t - 1 \rangle})\\ a^{\langle t \rangle} = \Gamma_o \cdot tanh(c^{\langle t \rangle}) \end{split} \end{equation} \subsection{XGBoost} XGBoost\cite{ML:XGBoost} is based on boosted decision trees (extreme gradient boosting). In this approach the data samples get split using a decision tree. With every step a new tree gets created to account for the errors of prior models, which are then added to create the final prediction. A gradient descent algorithm is used to minimize loss when adding new trees. \\ Its is often used as a classifier, however it can also used in regression models. In this thesis, an XGBoost classifier was used to determine a baseline and have some comparison for our bidirectional RNN classifier.