Newer
Older
rnn_bachelor_thesis / Report / 04_machine_learning.tex
\section{Machine learning}

\subsection{Introduction}

Machine learning has already proven itself to be very successful in resolving many problems in numerous other areas of science and also in the private sector. Based on these promising results, scientists are eager to study the potential of machine learning in physics.\\

There are several sections of machine learning. In this paper, we will focus mainly on neural networks(NN), with special attention to recurrent neural networks(RNN) \cite{connor1994recurrent}, \cite{grossberg2013recurrent} and XGBoost(XGB) \cite{ML:XGBoost} models with boosted decision trees.

\subsection{Artificial neural networks}

The fundamental concept behind artificial neural networks is to imitate the architecture of the human brain. They can be used for classification problems as well as regression problems. In its most simple form it can be thought of some sort of mapping from some input to some target. For this thesis two neural networks of a special subtype of neural networks, called recurrent neural networks, were used. All of the networks used in this thesis were written in the python library Keras \cite{chollet2015keras} with a Tensorflow \cite{abadi2016tensorflow} backend. In this section the basic principles of neural networks will be explained.\\

A neural network consists of many neurons organized in layers as seen in figure \ref{neural_network_arch}. Each neuron is connected to every neuron in the neighbouring layers, while each of these connections has a specific weight assigned to it.\\
In its most basic form, each neuron calculates a weighted sum to all of its inputs and then applies a bias to it . In addition, each neuron has an activation function, which will be applied at the end of the calculation (see also figure \ref{neuron}):

\begin{equation}
y = f_{activation} \left(\sum^n_{i=1}(x_i\cdot w_i) + b\right)
\end{equation}

This is done to create non linearity in the system. Later, some more complex architectures of neurons will be presented.\\
The first layer, also called input layer, is always defined by the number of inputs, with one dimension for each input. The dimensions of the following layers (excluding the last one), which are also called hidden layers, can be chosen to be an arbitrarily number. The number of dimensions of the last layer, also called output layer, is determined by the dimensionality of the prediction. The number of hidden layers, and their corresponding dimension, changes the performance of the system.

\begin{figure}[H]
\begin{center}
\begin{subfigure}{0.8\textwidth}
\includegraphics[width=1\textwidth]{img/neural_network.png}
\caption{Architecture of a neural network}
\label{neural_network_arch}
\end{subfigure}
\begin{subfigure}{0.5\textwidth}
\includegraphics[width=0.8\textwidth]{img/neuron.png}
\caption{Neuron}
\label{neuron}
\end{subfigure}
\caption{Neural network architecture}
\end{center}
\end{figure}

There is no way of knowing how many dimensions and layers will give you the best performance, as one can only define general effects of what happens when they are being modified. Generally, increasing the number of layers enables the system to solve more complex problems, while more dimensions make the system more flexible. However, even these general guidelines are to be applied with caution. For example; adding too many layers can cause the system to train exceedingly slow, whilst adding to many neurons with a too small training set can result in overfitting\footnote{When a system performs well on the training set but poorly on the test set}. Depending on the problem and the data given, each has its own optimal configuration. By gaining more experience with NN, people can take better guesses where to start. However, in the end it always results in some sort of systematic trial and error to find the optimal configuration.\\

The two RNN's used in this thesis both use $selu$'s \cite{klambauer2017self} as well as $tanh$ and $relu$'s as their activation functions.

\begin{align}
selu(x) = \lambda \begin{cases}
    x, & \text{if $x<0$}\\
    \alpha e^x - \alpha, & \text{otherwise}
  \end{cases}\\
relu(x) = MAX(0, x)
\end{align}

\begin{figure}[H]
\begin{center}
\begin{subfigure}{0.45\textwidth}
\includegraphics[width=1\textwidth]{img/selu.png}
\caption{Selu activation function}
\label{selu}
\end{subfigure}
\begin{subfigure}{0.45\textwidth}
\includegraphics[width=1\textwidth]{img/relu.png}
\caption{Relu activation function}
\label{relu}
\end{subfigure}
\begin{subfigure}{0.45\textwidth}
\includegraphics[width=1\textwidth]{img/tanh.png}
\caption{Tanh activation function}
\label{tanh}
\end{subfigure}
\caption{Activation functions}
\end{center}
\end{figure}

Where $\lambda$ and $\alpha$ are fixed parameters\footnote{For standard scaled inputs (mean $= 0$, stddev. $=1.0$): $\alpha \approx 1.6732$, $\lambda \approx 1.0507$}. Selu's have the advantage of normalizing the output. As a rule of thumb, normalized inputs usually tend to give better results (The output of the neurons are the input of other neurons). Using a $tanh$ was the standard approach for a long time although it has some disadvantages over the other activation functions. This is because its slope becomes really small for large numbers, which slows down training noticeably.\\

The neural network is trained with a sample of events. This sample consists of a few input parameters and a training target, which is the value the neural network will be trained to predict. Three important terms for the training of a neural network are epochs, batch size and loss function.\\
An epoch refers to one training iteration, where all of the training samples get used once and the weights and biases get modified to fit the wanted targets better. Usually a system is trained over many epochs until the weights and biases stay approximately constant at their optimal values.\\
Batch size refers to the number of examples that are given to the system at once during the training. Batch size should neither be chosen too small, e.g. small batch sizes train slower, nor too big, some randomness is wanted. Experience shows, that a reasonable batch size usually lies between 10 to 100 examples per batch. It is important to note that by decreasing batch size we make the minimum of the mapping we want to find wider. This makes finding the general area of the minimum easier. However if the minimum gets too wide, the slope gets to small to reach the minimum in a reasonable time. On the other side by increasing the batch size too much, the minimum gets exceedingly narrower and it possible to continuously keep "jumping" over the minimum with every training step performed.\\
To train the system we need some way to parametrize the quality of our predictions. To account for that we use a loss function. A loss function takes the predicted values of the system and the targeted values to give us an absolute value of our performance. There are various loss functions. In the two RNN's "mean squared error"(MSE, formula \ref{MSE}) and "binary crossentropy"(BC, formula \ref{BC}) were being used. The goal of every NN is to minimize the loss function.

\begin{align}
L(w,b) = \frac{1}{n} \sum^n_{i=1} (\hat{Y}_i(w_i,b_i) - Y_i)^2
\label{MSE}\\
L(w,b) = - \frac{1}{n} \sum^n_{i=1}\left(Y_i log(\hat{Y}_i(w_i,b_i))+(1-Y_i) log(1-\hat{Y}_i(w_i,b_i))\right)
\label{BC}
\end{align}

With:

\begin{itemize}
\item $w_i$ are the weights of the system
\item $b_i$ are the biases of the system
\item $\hat{Y}_i(w_i,b_i)$ are the predicted values of the system
\item $Y_i$ are the targets for the predictions
\item $L(w,b)$ is the loss over $n$ events
\end{itemize}

There exist several methods to minimize the loss. The most simple one being stochastic gradient descent(SGD). When performing SGD we can calculate the gradient and just apply it to our weights and biases. By doing this repeatedly, we will eventually end up in a minimum\footnote{It is very possible to also just get stuck in a local minimum}.\\
Trainings algorithm working with momentum are basically an improved version of SGD. To circumvent the problem of getting stuck in any minimum, our gradient can build up momentum of the past gradients. This is done by adding a momentum term to the applied changes to the weights and biases. The momentum is an exponentially decaying average over past gradients. This generally trains faster than SGD and has less potential to get stuck in local minima.\\
Another commonly-used modification of stochastic gradient decent is an adaptive
learning rate as implemented in the optimizer called RMSProp. This algorithm scales
the learning rate of each individual parameter by an exponentially decaying average
of the past squared gradients. The adaptation of the learning rate is done to set a
large learning rate if the past gradients were small in order to increase the step size
and vice versa. The average of the past squared gradients is exponentially decaying
since otherwise the learning rate would get really small after a few iterations.
The optimizer used in this thesis is called adam which stands for Adaptive Moment Estimation
[25]. Adam can be described as a combination of the Momentum and RMSProp
method since the estimates of the first and second moments of the past gradients are
used to scale the learning rate of each individual parameter. The first moment is
an exponentially decaying average of past gradients as in Momentum and the second
moment is an exponentially decaying average of past squared gradients as in RMSProp.