Newer
Older
rnn_bachelor_thesis / Report / 05_Data.tex
\section{Data}

\subsection{General information}
There were two sets of data used in this thesis. First, each of the datasets were shuffled to counteract any bias given by the sequence of the data and then split into two parts. $80\%$ was used to train the model(training set), while the remaining $20\%$ were later used to test the model(test set).\\
The sets were created using a Geant4 \cite{agostinelli2003s} based simulation with the specific configuration of the $\mu \rightarrow 3e$-experiment configuration.\\

The first dataset(dataset 1) contained 46896 true 8-hit tracks of recurling particles, and each hit consisting of 3 coordinates (x,y,z).\\

The second dataset(dataset 2) contained 109821 tracks. These were exclusively tracks that the current track reconstruction algorithm wasn't conclusively able to assign to an event. As a result, every event contained all the preselected tracks, computed by the already existing algorithm, that were calculated to be a possible track. It is important to note, that only for around $75\%$ of the events the true track was in this preselection. This posed an additional challenge, as one could not just simply choose the best fitting track. To assign the tracks to their corresponding events, they all carried an event number matching them with their event.\footnote{One number for all tracks of the same events}. Each track contained the coordinates of the 8 hits (x,y,z), the value of the $\chi^2$-fit performed by the reconstruction algorithm, the event number, as well as a label which told us if the track was true or false\footnote{Only used for training and testing of the system}.

\subsection{Preprocessing}

\subsubsection{Dataset 1}

To optimize the data fed into the RNN, dataset 1 was preprocessed. In a first step, a min-max scaler with a range of $[-0.9,0.9]$ from the python library Scikit-learn \cite{pedregosa2011scikit} was used. This particular choice of range was based on the fact that a $tanh$ activation function was used in the output layer. To accommodate for its properties of being asymptotically bounded by $\pm 1$, we chose a range of $[-0.9,0.9]$ to make all the data easily reachable by the system. In a second step, the data got shuffled and split into the training and test sets. The first four steps were used as an input for the RNN, while the second four steps were our prediction target.

\subsubsection{Dataset 2}
\label{dataset2}

Analogously to dataset 1, first the coordinates of the tracks, as well as the $\chi^2$, were scaled with a min max scaler (separate ones) with a range of $[-0.9,0.9]$ from the python library Scikit-learn. Then, the first four steps of every track were taken and fed into our first track predicting RNN. For each of the last four steps of a track we then had two sets of coordinates. One were the predicted coordinates of our RNN and the other one the coordinates given by the reconstructing algorithm. To have the information of the $\chi^2$ fit available at each step, we created an array of shape $(\#tracks, steps, 4)$ (1 dimension for each of the coordinates and another for the $\chi^2$ fit). However, at the spot of the x,y,z coordinates there were neither the predicted coordinates of our RNN nor the coordinates given by the reconstructing algorithm but instead the difference of the two. Our target was the truth value of each track\footnote{$1 =$ true, $0 =$ false}.