diff --git a/Report/00_main.log b/Report/00_main.log index 360ec50..dbc9d2f 100644 --- a/Report/00_main.log +++ b/Report/00_main.log @@ -1,4 +1,4 @@ -This is pdfTeX, Version 3.14159265-2.6-1.40.19 (MiKTeX 2.9.6730 64-bit) (preloaded format=pdflatex 2018.7.26) 4 AUG 2018 14:03 +This is pdfTeX, Version 3.14159265-2.6-1.40.19 (MiKTeX 2.9.6730 64-bit) (preloaded format=pdflatex 2018.7.26) 8 AUG 2018 19:51 entering extended mode **./00_main.tex (00_main.tex @@ -1696,16 +1696,6 @@ [] -Underfull \hbox (badness 10000) in paragraph at lines 124--125 - - [] - - -Underfull \hbox (badness 10000) in paragraph at lines 126--127 - - [] - - Underfull \hbox (badness 10000) in paragraph at lines 126--127 [] @@ -1906,32 +1896,32 @@ File: img/batch_norm.jpeg Graphic file (type jpg) -Package pdftex.def Info: img/batch_norm.jpeg used on input line 164. +Package pdftex.def Info: img/batch_norm.jpeg used on input line 163. (pdftex.def) Requested size: 390.0pt x 134.28722pt. [27 <./img/batch_norm.jpeg>] File: img/RNN_general_architecture.png Graphic file (type png) Package pdftex.def Info: img/RNN_general_architecture.png used on input line 1 -80. +79. (pdftex.def) Requested size: 390.0pt x 146.13263pt. -Underfull \hbox (badness 10000) in paragraph at lines 199--200 +Underfull \hbox (badness 10000) in paragraph at lines 198--199 [] [28 <./img/RNN_general_architecture.png>] -Underfull \hbox (badness 10000) in paragraph at lines 201--204 +Underfull \hbox (badness 10000) in paragraph at lines 200--203 [] -Underfull \hbox (badness 10000) in paragraph at lines 209--210 +Underfull \hbox (badness 10000) in paragraph at lines 208--209 [] -Underfull \hbox (badness 10000) in paragraph at lines 211--212 +Underfull \hbox (badness 10000) in paragraph at lines 210--211 [] @@ -1939,10 +1929,10 @@ File: img/LSTM_cell.png Graphic file (type png) -Package pdftex.def Info: img/LSTM_cell.png used on input line 217. +Package pdftex.def Info: img/LSTM_cell.png used on input line 216. (pdftex.def) Requested size: 312.00119pt x 186.04034pt. -Underfull \hbox (badness 10000) in paragraph at lines 237--238 +Underfull \hbox (badness 10000) in paragraph at lines 236--237 [] @@ -2094,7 +2084,7 @@ Package atveryend Info: Executing hook `AtVeryEndDocument' on input line 72. Package atveryend Info: Executing hook `AtEndAfterFileList' on input line 72. Package rerunfilecheck Info: File `00_main.out' has not changed. -(rerunfilecheck) Checksum: F467950E883FD22A1766479B9392E7BD;3897. +(rerunfilecheck) Checksum: 4FF9D8AF4C0B1885A24E06FE28964CD8;3898. LaTeX Warning: There were multiply-defined labels. @@ -2108,7 +2098,7 @@ 23449 multiletter control sequences out of 15000+200000 548944 words of font info for 87 fonts, out of 3000000 for 9000 1141 hyphenation exceptions out of 8191 - 47i,19n,65p,1103b,569s stack positions out of 5000i,500n,10000p,200000b,50000s + 47i,19n,65p,1105b,569s stack positions out of 5000i,500n,10000p,200000b,50000s pdfTeX warning (dest): name{Hfootnote.29} has been referenced but does not ex ist, replaced by a fixed one @@ -2213,7 +2203,7 @@ onts/cm/cmsy6.pfb> -Output written on 00_main.pdf (47 pages, 1706673 bytes). +Output written on 00_main.pdf (47 pages, 1706776 bytes). PDF statistics: 820 PDF objects out of 1000 (max. 8388607) 190 named destinations out of 1000 (max. 500000) diff --git a/Report/00_main.out b/Report/00_main.out index 25c4530..4780301 100644 --- a/Report/00_main.out +++ b/Report/00_main.out @@ -45,7 +45,7 @@ \BOOKMARK [1][-]{section.7}{RNN's used}{}% 45 \BOOKMARK [2][-]{subsection.7.1}{RNN for track prediction}{section.7}% 46 \BOOKMARK [2][-]{subsection.7.2}{RNN for classification of tracks}{section.7}% 47 -\BOOKMARK [1][-]{section.8}{Results}{}% 48 +\BOOKMARK [1][-]{section.8}{Analysis}{}% 48 \BOOKMARK [2][-]{subsection.8.1}{Best 2}{section.8}% 49 \BOOKMARK [2][-]{subsection.8.2}{RNN classifier with RNN track prediction input}{section.8}% 50 \BOOKMARK [2][-]{subsection.8.3}{XGBoost}{section.8}% 51 diff --git a/Report/00_main.pdf b/Report/00_main.pdf index 18299df..9077566 100644 --- a/Report/00_main.pdf +++ b/Report/00_main.pdf Binary files differ diff --git a/Report/00_main.synctex.gz b/Report/00_main.synctex.gz index bda1fa7..7ecca3b 100644 --- a/Report/00_main.synctex.gz +++ b/Report/00_main.synctex.gz Binary files differ diff --git a/Report/00_main.toc b/Report/00_main.toc index f4d58e5..8dbe0c0 100644 --- a/Report/00_main.toc +++ b/Report/00_main.toc @@ -46,7 +46,7 @@ \contentsline {section}{\numberline {7}RNN's used}{34}{section.7} \contentsline {subsection}{\numberline {7.1}RNN for track prediction}{34}{subsection.7.1} \contentsline {subsection}{\numberline {7.2}RNN for classification of tracks}{35}{subsection.7.2} -\contentsline {section}{\numberline {8}Results}{38}{section.8} +\contentsline {section}{\numberline {8}Analysis}{38}{section.8} \contentsline {subsection}{\numberline {8.1}Best $\chi ^2$}{38}{subsection.8.1} \contentsline {subsection}{\numberline {8.2}RNN classifier with RNN track prediction input}{38}{subsection.8.2} \contentsline {subsection}{\numberline {8.3}XGBoost}{40}{subsection.8.3} diff --git a/Report/01_Standard_Model.tex b/Report/01_Standard_Model.tex index dd21d04..e022266 100644 --- a/Report/01_Standard_Model.tex +++ b/Report/01_Standard_Model.tex @@ -3,8 +3,8 @@ \label{intro_elem_part} The Standard Model(SM) describes all known elementary particles as well as three of the four known forces\footnote{Strong, weak and electromagnetic forces}.\\ -The elementary particles that make up matter can be split into two categories, namely quarks and leptons. There are 6 types of quarks and six types of leptons. The type of a particle is conventionally called flavour. The six quark flavours and the three lepton flavours are separated over 3 generations (each which two quarks and two leptons in it). -Experimental evidence suggests that there exist exactly three generations of particles \cite{akrawy1989measurement}. Each particle of the first generation has higher energy versions of itself with the similar properties, besides their mass, (e.g. $e^- \rightarrow \mu^- \rightarrow \tau^-$) as in other generations. For each following generation, the particles have a higher mass than the generation before. +The elementary particles that make up matter can be split into two categories, namely quarks and leptons. There are 6 types of quarks and six types of leptons. The type of a particle is conventionally called flavour. The six quark flavours and the three lepton flavours are separated over 3 generations (each with two quarks and two leptons in it). +Experimental evidence suggests that exactly three generations of particles exist \cite{akrawy1989measurement}. Each particle of the first generation has higher energy versions of itself with the similar properties, besides their mass, (e.g. $e^- \rightarrow \mu^- \rightarrow \tau^-$) as in other generations. For each following generation, the particles have a higher mass than the generation before. \begin{table}[H] \begin{center} @@ -23,7 +23,7 @@ \end{table} One category consists of quarks($q$)(see Table \ref{Quark_SM_table}). In this, we differentiate between up-type quarks, with charge $-\frac{1}{3}e$, and down-type, quarks with charge $\frac{2}{3}e$. Quarks interact with all fundamental forces.\\ -Each quark carries a property called colour-charge. The possible colour charges are red(r), green(gr), blue(bl) in which anti-quarks carry anti-colour. Quarks can only carry one colour, whilst every free particle has to be colourless\footnote{Colour confinement}. In conclusion we cannot observe a single quark.\\ +Each quark carries a property called colour-charge. The possible colour charges are red(r), green(gr), blue(bl) in which anti-quarks carry anti-colour. Quarks can only carry one colour, whilst every free particle has to be colourless\footnote{Colour confinement}. In conclusion we cannot observe a single quark as they always appear in pairs of two or three to achieve colourlessness.\\ Free particles can achieve being colourless in two ways. Either by having all three colours present in the same amount (one quark of each colour), which creates the characteristic group of baryons($qqq$) and anti-baryons($\bar{q}\bar{q}\bar{q}$) or by having a colour and its anticolour present, which creates the group of mesons($q\bar{q}$). \begin{table}[H] @@ -45,7 +45,7 @@ The other group consists of leptons(l)(see Table \ref{Lepton_SM_table}). They only interact through the weak and the electromagnetic force. Each generation consists of a lepton of charge -1 and a corresponding EM neutrally charged neutrino. The electron has the lowest energy of all charged leptons. This makes the electron stable while the higher generation particles decay to lower energy particles.\\ -The leptons of one generation, namely the charged lepton and its corresponding neutrino are called a lepton family. A lepton of a family counts as 1 to its corresponding lepton family number whilst a anti-lepton counts as -1. +The leptons of one generation, namely the charged lepton and its corresponding neutrino, are called a lepton family. A lepton of a family counts as 1 to its corresponding lepton family number whilst a anti-lepton counts as -1. \begin{table}[H] \begin{center} @@ -61,18 +61,18 @@ \end{table} -The particles of the SM interact through the 3 fundamental forces of the SM. In these interactions, particles called bosons are being exchanged which are the carriers of their respective force (see Table \ref{fund_forces_table}).\\ +The particles of the SM interact through the 3 fundamental forces of the SM. In these interactions, particles called bosons are being exchanged, which are the carriers of their respective force (see Table \ref{fund_forces_table}).\\ As mentioned above, only quarks can interact through the strong force, in which they exchange gluons. Gluons are massless and EM neutrally charged. The strong force has the biggest coupling strength of 1 (though it decreases with higher energies as a result of gluon-gluon self interaction loops, which interfere negatively in perturbation theory)\cite{thomson2013modern}. A gluon carries colour charge and hence can change the colour of a quark but it conserves its flavour. The strong interaction has an underlying gauge symmetry of SU(3). Therefore, it can be derived that colour charge is conserved through the strong interaction\footnote{E.g. through Gell-Mann matrices}.\\ -The electromagnetic(EM) force is propagated through the photon. It carries zero charge and no invariant mass. Exclusively charged particles can interact through the electromagnetic force. The coupling strength is $\alpha \approx \frac{1}{137}$, contrary to the strong force the coupling constant increases with higher energies\cite{thomson2013modern}. This difference stems from the fact that photon-photon interaction loops are not allowed whereas gluon-gluon interaction loops are. In perturbation theory this results in only positive terms being added to the coupling strength. The underlying gauge symmetry is of SU(1). The electromagnetic force also conserves flavour.\\ +The electromagnetic(EM) force is propagated through the photon. It carries zero charge and no invariant mass. Only charged particles can interact through the electromagnetic force. The coupling strength is $\alpha \approx \frac{1}{137}$. Contrary to the strong force, the coupling constant increases with higher energies\cite{thomson2013modern}. This difference stems from the fact that photon-photon interaction loops are not allowed whereas gluon-gluon interaction loops are. In perturbation theory this results in only positive terms being added to the coupling strength. The underlying gauge symmetry is of SU(1). The electromagnetic force also conserves flavour.\\ The weak force has two types of bosons. The bosons of the weak force are the only fundamental bosons to have an inertial mass.\\ -First we will discuss the EM neutral Z boson. Even though the Z boson belongs to the weak force it, it also has an electromagnetic part additionally to the weak force part\footnote{$Z \rightarrow EM_{part} + W^3$, \cite{thomson2013modern}}. It follows directly, that the Z boson couples weaker to uncharged particles.\\ -The other boson of the weak force is the W boson. In the classical SM, the only way particles can change flavour is through the weak force by emitting or absorbing W boson. It is important to notice that, besides of having an invariant mass, the W boson is the only boson with a non zero charge ($Q_{W^\pm} = \pm 1e$). In the gauge symmetry of the weak force the $W^\pm$ are actually the creation and annihilation operators of said symmetry\footnote{$W^\pm = W_1 \pm i W_2$}.\\ -An important characteristic of the weak force is that it exclusively couples to lefthanded(LH) particles and righthanded(RH) antiparticles (describing chirality states)\footnote{In the ultrarelativistic limit helicity and chirality eigenstates are the same}.\\ +First we will discuss the EM neutral Z boson. Even though the Z boson belongs to the weak force, it also has an electromagnetic part additionally to the weak force part\footnote{$Z \rightarrow EM_{part} + W^3$, \cite{thomson2013modern}}. It follows directly, that the Z boson couples weaker to uncharged particles.\\ +The other boson of the weak force is the W boson. In the classical SM, the only way particles can change flavour is through the weak force by emitting or absorbing W boson. It is important to notice, that besides of having an invariant mass, the W boson is the only boson with a non zero charge ($Q_{W^\pm} = \pm 1e$). In the gauge symmetry of the weak force the $W^\pm$ are actually the creation and annihilation operators of said symmetry\footnote{$W^\pm = W_1 \pm i W_2$}.\\ +An important characteristic of the weak force is, that it exclusively couples to lefthanded(LH) particles and righthanded(RH) antiparticles (describing chirality states)\footnote{In the ultrarelativistic limit helicity and chirality eigenstates are the same}.\\ The chirality operators for left- and righthandedness are: \\ LH: $\frac{1}{2}(1-\gamma^5)$, RH: $\frac{1}{2}(1+\gamma^5)$\\ -As a consequence RH particles and LH anti-particles can't couple to the W boson at all. This also results in charged RH particles and LH anti-particles to couple to the Z boson only through the electromagnetic part of the itself, while uncharged RH particles and LH anti particles (e.g. RH $\nu$, LH $\bar{\nu}$) don't couple with the EM force nor the weak force. +As a consequence, RH particles and LH anti-particles cannot couple to the W boson at all. This also results in charged RH particles and LH anti-particles to couple to the Z boson only through the electromagnetic part of the Z boson, while uncharged RH particles and LH anti particles (e.g. RH $\nu$, LH $\bar{\nu}$) don't couple with the EM force nor the weak force. \subsection{Interaction rules} @@ -80,7 +80,7 @@ Now, we will establish the general rules for interactions in the SM.\\ \textbf{Baryon number is conserved}\\ -As we already established before, the only interaction that can change flavour is the weak force through the W boson. We directly see that all other interactions baryon number has to be conserved. So any up-type quark can be changed to a down-type quark and backwards by emitting or absorbing a W boson. In the end however, there are still 3 quarks which form a baryon\footnote{Pentaquarks($qqqq\bar{q}$) and other exotic states excluded}, even though it changed its type and charge. A well known example is the beta decay, where a down quark in a neutron decays into an up quark to form now a proton(e.g. see Figure \ref{beta-decay_feynman}). We easily see that the baryon number is conserved.\\\\ +As we already established before, the only interaction that can change flavour is the weak force through the W boson. We directly see, that for all other interactions baryon number has to be conserved. So any up-type quark can be changed into a down-type quark and vice versa by emitting or absorbing a W boson. In the end however, there are still 3 quarks which form a baryon\footnote{Pentaquarks($qqqq\bar{q}$) and other exotic states excluded}, even though it changed its type and charge. A well known example is the beta decay, where a down quark in a neutron decays into an up quark to form now a proton(e.g. see Figure \ref{beta-decay_feynman}). We easily see that the baryon number is conserved.\\\\ \begin{figure}[H] \begin{center} @@ -100,7 +100,7 @@ \textbf{Lepton family number is conserved}\\ According to the SM lepton family number is conserved. As all interactions beside the W conserve particle flavour, it is easy to see that lepton family number is conserved.\\ -Whenever a lepton interacts with a W boson, it just changes a lepton to its corresponding lepton neutrino and or the other way around (e.g. see Figure \ref{muon-decay_feynman}).\newpage +Whenever a lepton interacts with a W boson, it just changes a lepton to its corresponding lepton neutrino or the other way around (e.g. see Figure \ref{muon-decay_feynman}).\newpage \section{Physics beyond the SM} @@ -121,13 +121,13 @@ \label{PMNS_neutrino} \end{equation} -As a result, neutrinos propagate as a superposition of all mass eigenstates. Additionally, we can describe the PMNS matrix through three mixing angles $\theta_{12}$, $\theta_{13}$ and $\theta_{23}$ and a complex phase $\delta$ \footnote{Measurements: $\theta_{12} \approx 35^\circ$, $\theta_{13} \approx 10^\circ$, $\theta_{23} \approx 45^\circ$ \cite{abe2008precision}, \cite{adamson2011measurement}}. The electron superposition looks then like this:\\\\ +As a result, neutrinos propagate as a superposition of all mass eigenstates. Additionally, we can describe the PMNS matrix through three mixing angles $\theta_{12}$, $\theta_{13}$ and $\theta_{23}$ and a complex phase $\delta$ \footnote{Measurements: $\theta_{12} \approx 35^\circ$, $\theta_{13} \approx 10^\circ$, $\theta_{23} \approx 45^\circ$ \cite{abe2008precision}, \cite{adamson2011measurement}}. Using this, the electron superposition looks like this:\\ -$\ket{\nu_e} = U_{e_1} \ket{\nu_1} e^{{-i \Phi_1}} + U_{e_2} \ket{\nu_2} e^{{-i \Phi_2}} + U_{e_3} \ket{\nu_3} e^{{-i \Phi_3}}$ with $\Phi_i = E_i \times t$\\\\ +$\ket{\nu_e} = U_{e_1} \ket{\nu_1} e^{{-i \Phi_1}} + U_{e_2} \ket{\nu_2} e^{{-i \Phi_2}} + U_{e_3} \ket{\nu_3} e^{{-i \Phi_3}}$ with $\Phi_i = E_i \times t$\\ As a result lepton family number is not a conserved quantity anymore as neutrino flavour oscillates over time.\\ -We can calculate the probability for a neutrino to transition from flavour $\alpha$ to $\beta$ like: +We can calculate the probability for a neutrino to transition from flavour $\alpha$ to $\beta$ like this: \begin{equation} \begin{split} @@ -151,19 +151,19 @@ \end{figure} Nowadays it's a well accepted fact that lepton family number gets violated through neutrino oscillation.\\ -But why should flavour oscillation be exclusive to neutrinos?\\ +However, why should flavour oscillation be exclusive to neutrinos?\\ Maybe there are ways for the EM charged leptons as well to directly transition to another lepton family\footnote{Maybe also possible for quarks?}? \subsection{New physics} As a consequence of neutrino oscillation, lepton flavour is a broken symmetry. The SM has to be adapted to include lepton flavour violation (LFV) and massive neutrinos. LFV is also expected for charged neutrinos.\\ -Although, it has yet to be determined how LFV violation exactly works to which scale it exists.\\ +Although, it has yet to be determined how LFV exactly works and to which scale it exists.\\ This may raise the question on why charged LFV has never been observed yet. This is especially surprising as the mixing angles of the neutrinos have been measured to be big.\\ There are two reasons why charged LFV is strongly surpressed: -The first is that charged leptons are much heavier than neutrinos and the other that the mass differences between neutrino flavour are tiny compared to the W boson mass.\\ +The first is, that charged leptons are much heavier than neutrinos, and the other, that the mass differences between neutrino flavour are tiny compared to the W boson mass.\\ -In the classical SM, charged LFV is already forbidden at tree level. Though it can be induced indirectly through higher order loop diagrams (using neutrino oscillation). By adding new particles beyond the SM, we generate new ways for LFV in the charged sector to happen. As LFV is naturally generated in many models beyond the SM, finding charged LFV is a strong hint for new physics. +In the classical SM, charged LFV is already forbidden at tree level. Though it can be induced indirectly through higher order loop diagrams using neutrino oscillation. By adding new particles beyond the SM, we generate new ways for LFV in the charged sector to happen. As LFV is naturally generated in many models beyond the SM, finding charged LFV is a strong hint for new physics. \begin{figure}[H] \begin{center} @@ -186,7 +186,7 @@ \end{center} \end{figure} -One way charged LFV can occur is through supersymmetric particles (see Figure \ref{LFV-SUSY}). By observing charged LFV supersymmetry would gain new importance.\\ -Together with supersymmetric models, other extensions of the SM such as left-right symmetric models, grand unified models, models with an extended Higgs sector and models where electroweak symmetry is broken dynamically are all good candidates to explain charged LFV and most importantly experimentally accessible in a large region of the parameter space. +One way charged LFV can occur is through supersymmetric particles (see Figure \ref{LFV-SUSY}). By observing charged LFV, supersymmetry would gain new importance.\\ +Alongside with supersymmetric models, other extensions of the SM, such as left-right symmetric models, grand unified models, models with an extended Higgs sector and models where electroweak symmetry is broken dynamically, are all good candidates to explain charged LFV and most importantly, experimentally accessible in a large region of the parameter space. diff --git a/Report/02_mu_to_3e_decay.tex b/Report/02_mu_to_3e_decay.tex index 6620d10..bec86b0 100644 --- a/Report/02_mu_to_3e_decay.tex +++ b/Report/02_mu_to_3e_decay.tex @@ -7,7 +7,7 @@ Possible ways for the decay $\mu \rightarrow eee$ to occur are shown in Figures \ref{LFV-neutrino_osc}, \ref{LFV-SUSY}, \ref{LFV-tree_lvl}.\\ -Still some simplifications are made as it is assumed that only the tree and the photon diagram are relevant. \cite{blondel2013research}\\ +Still, some simplifications are made, as it is assumed, that only the tree and the photon diagram are relevant. \cite{blondel2013research}\\ This gives us a Lagrangian of: @@ -15,7 +15,7 @@ L_{LFV} = \left[\frac{m_\mu}{(\kappa+1)\Lambda^2}\overline{\mu_R}\sigma^{\mu\nu}e_LF_{\mu\nu}\right]_{\gamma-\text{penguin}}+\left[\frac{\kappa}{(\kappa+1)\Lambda^2}(\overline{\mu_L}\gamma^{\mu}e_L)(\overline{e_L}\gamma_\mu e_L)\right]_{\text{tree}} \end{equation} -If we neglect signal and background we can use momentum conservation as the decay happens rather quickly. As a result the total sum of all particle momenta should be equal to zero: +If we neglect signal and background, we can use momentum conservation as the decay happens rather quickly. As a result, the total sum of all particle momenta should be equal to zero: \begin{equation} \left\vert \vec{p}_{tot} \right\vert = \left\vert\sum \vec{p}_i \right\vert = 0 @@ -28,28 +28,28 @@ Below is a summary of all the different types of background considered in the experiment. \subsubsection{Internal conversions} -The event $\mu \rightarrow eee\nu\nu$ results in the same particles seen by the detector as the event we are searching for\footnote{Neutrinos are invisible to our detector}. As a result it proves to be quite challenging to separate the two.\\ +The event $\mu \rightarrow eee\nu\nu$ results in the same particles seen by the detector as the event we are searching for\footnote{Neutrinos are invisible to our detector}. As a result, it proves to be quite challenging to separate the two.\\ By using momentum conservation, it becomes possible to differentiate the $\mu \rightarrow eee$ and the $\mu \rightarrow eee\nu\nu$ events. In the muon rest frame the total momentum is zero and the energy of the resulting particles is equal to muon rest energy.\\ -By reconstructing the energy and momenta of the three $e$ we can check if their momenta add up to zero and their energies equal the muon rest energy. If not we can assume that there are additional neutrinos. This differentiation between the two events is crucial for the experiment as the $\mu \rightarrow eee\nu\nu$ events pose the most serious background for $\mu \rightarrow eee$ decay measurements.\\ -As a result, our detector needs a very good energy resolution to consistently make it possible to differentiate between the two events as neutrino energies and momenta are very small. +By reconstructing the energy and momenta of the three $e$ we can check, if their momenta add up to zero and their energies equal the muon rest energy. If not, we can assume that there are additional neutrinos. This differentiation between the two events is crucial for the experiment as the $\mu \rightarrow eee\nu\nu$ events pose the most serious background for $\mu \rightarrow eee$ decay measurements.\\ +As a result, our detector needs a very good energy resolution to consistently make it possible to differentiate between the two events, as neutrino energies and momenta are very small. \subsubsection{Michel decay} -The biggest contributing background however stems from another decay called Michel decay, that is also allowed in the classical SM. As we use a beam of positive muons the corresponding Michel decay looks as follows: $\mu^+ \rightarrow e^+ \nu\bar{\nu}$.\\ -Contrary to the events before this one does not produce any em negatively charged particles. This makes these events easily distinguishable from our wanted events. As a result they only enter our data in form of a potential background through wrongly constructed tracks. +The biggest contributing background however, stems from another decay called Michel decay, that is also allowed in the classical SM. As we use a beam of positive muons the corresponding Michel decay looks as follows: $\mu^+ \rightarrow e^+ \nu\bar{\nu}$.\\ +Contrary to the events before, this one does not produce any EM negatively charged particles. This makes these events easily distinguishable from our wanted events. Therefore, they only enter our data in form of a potential background through wrongly constructed tracks. \subsubsection{Radiative muon decay} -This is the case where $\mu \rightarrow e^+\gamma\nu\nu$. If the photon produced in this event has high enough energies and creates a matter antimatter pair in the target region ($\gamma \rightarrow e^-e^+$), it can create a similar signature than the searched event. They contribute to the accidental background, as equal to the searched event no neutrinos are produced. To minimize these effects, the material in both the target and detector is minimized and a vertex constraint is applied. +This is the case where $\mu \rightarrow e^+\gamma\nu\nu$. If the photon produced in this event has high enough energies and creates a matter antimatter pair in the target region ($\gamma \rightarrow e^-e^+$), it can create a similar signature than the searched event. They contribute to the accidental background, as equal to the searched event, no neutrinos are produced. To minimize these effects, the material in both the target and detector is minimized and a vertex constraint is applied. \subsubsection{BhaBha scattering} -Another way how background can get produced is when positrons from muon decays or the beam itself scatter with electrons in the target material. Consequently they share a common vertex and together with an ordinary muon decay it can look similar as our searched $\mu \rightarrow eee$ event. This contributes to the accidental background. +Another way how background can get produced is, when positrons from muon decays or the beam itself scatter with electrons in the target material. Consequently, they share a common vertex and together with an ordinary muon decay it can look similar as our searched $\mu \rightarrow eee$ event. This contributes to the accidental background. \subsubsection{Pion decays} -Certain pion decays also lead to indistinguishable signature as our searched event, the most prominent being the $\pi \rightarrow eee\nu$ and $\pi \rightarrow \mu\gamma\nu$ decays. The later only produces a similar signature if produced photon converts through pair production to an electron and a positron.\\ -However, as only a negligible portion will actually contribute to the background, as there is only a small branching fraction and the momenta and energy of the produced particles have to match up with the criteria mentioned in section \ref{Kinematics}. +Certain pion decays also lead to indistinguishable signature as our searched event, the most prominent being the $\pi \rightarrow eee\nu$ and $\pi \rightarrow \mu\gamma\nu$ decays. The latter only produces a similar signature, if produced photon converts through pair production to an electron and a positron.\\ +However, only a negligible portion will actually contribute to the background, as there is only a small branching fraction and the momenta and energy of the produced particles have to match up with the criteria mentioned in section \ref{Kinematics}. \subsubsection{Analysis of the background} diff --git a/Report/03_experimental_setup.tex b/Report/03_experimental_setup.tex index 5d3569e..9bd8871 100644 --- a/Report/03_experimental_setup.tex +++ b/Report/03_experimental_setup.tex @@ -3,19 +3,19 @@ \subsection{Requirements} The ultimate goal of this experiment is to observe a $\mu \rightarrow eee$ event. As we strive for a sensitivity of $10^{-16}$ , we should be able to observe this process if its branching ratio would be higher than our sensitivity. Otherwise, we want to exclude a branching ratio $>10^{-16}$ with a $90\%$ certainty.\\ -To get to this sensitivity, more than $5.5 \cdot 10^{16}$ muon decays have to be observed. To reach this goal within one year, a muon stopping rate of $2 \cdot 10^9 Hz$ in combination with a high geometrical acceptance as well as a high efficiency of the experiment is required. +To get to this sensitivity, more than $5.5 \cdot 10^{16}$ muon decays have to be observed. In order to reach this goal within one year, a muon stopping rate of $2 \cdot 10^9 Hz$ in combination with a high geometrical acceptance, as well as a high efficiency of the experiment is required. \subsection{Phase I} -Phase I of the experiment serves as an exploratory phase to gain more experience with the new technology and validate the experimental concept. At the same time it already strives to produce competitive measurements with a sensitivity of $10^{-15}$. \footnote{Current experiments are in the $10^{-12}$ sensitivity range} This will be done, by making use of the already existing muon beams at PSI with around $1$-$1.5\cdot10^{8}Hz$ of muons on target. The lowered sensitivity also allows for some cross-checks as the restrictions on the system are much more relaxed than in phase II. +Phase I of the experiment serves as an exploratory phase to gain more experience with the new technology and validate the experimental concept. At the same time, it already strives to produce competitive measurements with a sensitivity of $10^{-15}$. \footnote{Current experiments are in the $10^{-12}$ sensitivity range} This is achieved by making use of the already existing muon beams at PSI with around $1$-$1.5\cdot10^{8}Hz$ of muons on target. The lowered sensitivity also allows for some cross-checks, as the restrictions on the system are much more relaxed than in phase II. \subsection{Phase II} -Phase II strives to reach the maximum sensitivity of $10^{-16}$. To achieve this in a reasonable timeframe, a new beamline will be used which delivers more than $2\cdot10^{9}Hz$ of muons. +Phase II strives to reach the maximum sensitivity of $10^{-16}$. To achieve this in a reasonable timeframe, a new beamline will be used, which delivers more than $2\cdot10^{9}Hz$ of muons. \subsection{Experimental setup} \label{exp_setup} -The detector is of cylindrical shape around the beam. It has a total length of around $2m$ and is situated inside a $1T$ solenoid magnet with $1m$ of inner radius and a total length of $2.5m$. This form was chosen to cover as much phase space as possible. For an unknown decay such $\mu \rightarrow eee$, it crucial to have a high order of acceptance in all regions of phase space. There are only two kind of tracks that get lost. The first one are up- and downstream tracks and the second one are low transverse momenta tracks (no transversing of enough detector planes to be reconstructed). +The detector is of cylindrical shape around the beam. It has a total length of around $2m$ and is situated inside a $1T$ solenoid magnet with $1m$ of inner radius and a total length of $2.5m$. This form was chosen to cover as much phase space as possible. For an unknown decay such $\mu \rightarrow eee$, it is crucial to have a high order of acceptance in all regions of phase space. There are only two kind of tracks that get lost. The first ones are up- and downstream tracks and the second one are low transverse momenta tracks (no transversing of enough detector planes to be reconstructed). \begin{figure}[H] \begin{center} @@ -48,22 +48,22 @@ \end{center} \end{figure}\newpage -As seen in figure \ref{setup_II}, the final version of the detector can be divided into 5 separate parts in the longitudinal direction. There is the central part with the target, two inner silicon pixel layers, a fibre tracker and two outer silicon layers. The forward and backward parts, called recurl stations, consist only of a tile timing detector surrounded by two silicon recurl layers. A big advantage of this layout is that even a partially constructed detector (gradually over phase I to phase II parts get added) can give us competitive measurements.\\ +As seen in figure \ref{setup_II}, the final version of the detector can be divided into 5 separate parts in the longitudinal direction. There is the central part with the target, two inner silicon pixel layers, a fibre tracker and two outer silicon layers. The front and back parts, called recurl stations, consist only of a tile timing detector surrounded by two silicon recurl layers. A big advantage of this layout is, that even a partially constructed detector (gradually over phase I to phase II parts get added) can give us competitive measurements.\\ The target itself is a big surfaced double cone with a surface length of $10cm$ and a width of $2cm$. The target was chosen specifically to be of this shape to facilitate separating tracks coming from different muons and hereby also helping to reduce accidental background.\\ -The two inner detector layers, also called vertex layers, span a length $12cm$. The innermost layer consists of 12 tiles while the outer vertex layer consists of 18 tiles. The tiles are each of $1cm$ width, with the inner layer having an average radius of $1.9cm$, respectively $2.9cm$, and a pixel size of $80 \cross 80 \mu m^2$. \cite{augustin2017mupix}, \cite{philipp2015hv}, \cite{augustin2015mupix}. They are supported by two half cylinder made up of $25\mu m$ thin Kapton foil mounted on plastic. The detector layers itself are $50\mu m$ thin and cooled by gaseous helium. The vertex detectors are read out at a rate of $20MHz$, giving us a time resolution of $20ns$.\\ -After the vertex layers the particles pass through the fibre tracker (see Figure \ref{tracks_Ib,_II}, \ref{setup_II}). It is positioned around $6cm$ away from the center. Its main job is to provide accurate timing information for the outgoing electrons and positrons. It consists of three to five layers, each consisting of $36cm$ long and $250\mu m$ thick scintillating fibres with fast silicon photomultipliers at the end. They provide us a timing information of less than a $1ns$.\\ +The two inner detector layers, also called vertex layers, span a length $12cm$. The innermost layer consists of 12 tiles while the outer vertex layer consists of 18 tiles. The tiles are each of $1cm$ width, with the inner layer having an average radius of $1.9cm$, respectively $2.9cm$, and a pixel size of $80 \cross 80 \mu m^2$. \cite{augustin2017mupix}, \cite{philipp2015hv}, \cite{augustin2015mupix}. They are supported by two half cylinder made up of $25\mu m$ thin Kapton foil mounted on plastic. The detector layers themselves are $50\mu m$ thin and cooled by gaseous helium. The vertex detectors are read out at a rate of $20MHz$, giving us a time resolution of $20ns$.\\ +After the vertex layers, the particles pass through the fibre tracker (see Figure \ref{tracks_Ib,_II}, \ref{setup_II}). It is positioned around $6cm$ away from the center. Its main job is to provide accurate timing information for the outgoing electrons and positrons. It consists of three to five layers, each consisting of $36cm$ long and $250\mu m$ thick scintillating fibres with fast silicon photomultipliers at the end. They provide us a timing information of less than $1ns$.\\ Next the outgoing particles encounter the outer silicon pixel detectors. They are mounted just after the fibre detector with average radii of $7.6cm$ and $8.9cm$. The inner layer has 24 and the outer has 28 tiles of $1cm$ length. The active area itself has a length of $36cm$. Similarly to the vertex detectors, they are mounted on $25\mu m$ thin Kapton foil with plastic ends.\\ -The stations beam up- and downwards only consist of the outer pixel detector layers as well as a timing detector. While the silicon detector are the same as in the central station, the timing tracker was chosen to be much thicker than the fibre detector in the central station. It consists of scintillating tiles with dimensions of $7.5 \cross 7.5 \cross 5 mm^3$. They provide an even better time resolution than the fibre tracker in the center. Incoming particles are supposed to be stopped here. The outer stations are mainly used to determine the momenta of the outgoing particles and have an active length of $36cm$ and a radius of around $6cm$. +The stations beam up- and downwards only consist of the outer pixel detector layers, as well as a timing detector. While the silicon detector are the same as in the central station, the timing tracker was chosen to be much thicker than the fibre detector in the central station. It consists of scintillating tiles with dimensions of $7.5 \cross 7.5 \cross 5 mm^3$. They provide an even better time resolution than the fibre tracker in the center. Incoming particles are supposed to be stopped here. The outer stations are mainly used to determine the momenta of the outgoing particles and have an active length of $36cm$ and a radius of around $6cm$. \subsection{The problem of low longitudinal momentum recurlers} -As explained in section \ref{exp_setup}, the outgoing particles are supposed to recurl back into the outer stations of the detector to enable a precise measurement of the momentum. A problem arises if the particles have almost no momentum in the beam direction. Then they can recurl back into the central station and cause additional hits there. As the the central station is designed to let particles easily pass through, they can recurl inside the central station many more times without getting stopped. As we have a $20ns$ time window for the readout of the pixel detectors, we need a very reliable way to identify and reconstruct these tracks as recurling particles as otherwise they look exactly like newly produced particles coming from our target. As one can imagine, this influences the precision of our measurements by a big margin. So, finding a way to identify these low beam direction momentum particles consistently is of great importance as it is crucial for the experiment to reduce the background as much as possible.\\ +As explained in section \ref{exp_setup}, the outgoing particles are supposed to recurl back into the outer stations of the detector to enable a precise measurement of the momentum. A problem arises if the particles have almost no momentum in the beam direction. Then they can recurl back into the central station and cause additional hits there. As the the central station is designed to let particles easily pass through, they can recurl inside the central station many more times without getting stopped. As we have a $20ns$ time window for the readout of the pixel detectors, we need a very reliable way to identify and reconstruct these tracks of recurling particles, as otherwise they look exactly like newly produced particles coming from our target. As one can imagine, this influences the precision of our measurements by a big margin. So, finding a way to identify these low beam direction momentum particles consistently is of great importance, as it is crucial for the experiment to reduce the background as much as possible.\\ There is already an existing software to reconstruct particle tracks. However, it struggles to find the right tracks for a lot of the particles recurling back into the center station.\\ These recurlers will typically leave eight hits or more, four (one on each silicon pixel detector layer) when initially leaving the detector and another four when initially falling back in. It is possible for these recurlers to produce even more hits when leaving the detector again but for this thesis we will be only focusing on these 8 hit tracks.\\ The current reconstruction algorithm works by fitting helix paths with a $\chi^2$ method onto the 8 hits.\\ -However, experience has shown that often the fit with the lowest $\chi^2$ isn't necessarily the right track. If we increase the $\chi^2$ limit value to some arbitrary limit, we get a selection of several possible tracks per particle. Without any additional tools however, it is impossible to figure out if the right track is in the selection\footnote{\alignLongunderstack{\text{Based on detector efficiency it is possible for a particle to leave less}\\ \text{than 8 tracks and therefore not be reconstructed by the algorithm}}} and if yes which one of them correct track is. +However, experience has shown that often the fit with the lowest $\chi^2$ isn't necessarily the right track. If we increase the $\chi^2$ limit value to some arbitrary limit, we get a selection of several possible tracks per particle. Without any additional tools however, it is impossible to figure out if the right track is in the selection\footnote{\alignLongunderstack{\text{Based on detector efficiency it is possible for a particle to leave less}\\ \text{than 8 tracks and therefore not be reconstructed by the algorithm}}} and if yes, which one of them correct track is. \begin{figure}[H] \begin{center} diff --git a/Report/04_machine_learning.tex b/Report/04_machine_learning.tex index 064aa16..6f41d63 100644 --- a/Report/04_machine_learning.tex +++ b/Report/04_machine_learning.tex @@ -11,9 +11,9 @@ \subsubsection{General concepts} -The fundamental concept behind artificial neural networks is to imitate the architecture of the human brain. They can be used for classification problems as well as regression problems. In its most simple form, it can be thought of some sort of mapping from some input to some target. For this thesis two neural networks of a special subtype of neural networks, called recurrent neural networks, were used. All of the networks used in this thesis were written in the python library Keras \cite{chollet2015keras} with a Tensorflow \cite{abadi2016tensorflow} backend. In this section the basic principles of neural networks will be explained.\\ +The fundamental concept behind artificial neural networks is, to imitate the architecture of the human brain. They can be used for classification problems, as well as regression problems. In its most simple form, it can be thought of some sort of mapping from some input to some target. For this thesis two neural networks of a special subtype of neural networks, called recurrent neural networks, were used. All of the networks used in this thesis were written in the python library Keras \cite{chollet2015keras} with a Tensorflow \cite{abadi2016tensorflow} backend. In this section the basic principles of neural networks will be explained.\\ -A neural network consists of many neurons organized in layers as seen in figure \ref{neural_network_arch}. Each neuron is connected to every neuron in the neighbouring layers, while each of these connections has a specific weight assigned to it.\\ +A neural network consists of many neurons organized in layers, as seen in figure \ref{neural_network_arch}. Each neuron is connected to every neuron in the neighbouring layers, while each of these connections has a specific weight assigned to it.\\ In its most basic form, each neuron calculates a weighted sum to all of its inputs and then applies a bias to it . In addition, each neuron has an activation function, which will be applied at the end of the calculation (see also figure \ref{neuron}): \begin{equation} @@ -21,7 +21,7 @@ \end{equation} This is done to create non linearity in the system. Later, some more complex architectures of neurons will be presented.\\ -The first layer, also called input layer, is always defined by the number of inputs, with one dimension for each input. The dimensions of the following layers (excluding the last one), which are also called hidden layers, can be chosen to be an arbitrarily number. The number of dimensions of the last layer, also called output layer, is determined by the dimensionality of the prediction. The number of hidden layers, and their corresponding dimension, changes the performance of the system. +The first layer, also called input layer, is always defined by the number of inputs, with one dimension for each input. The dimensions of the following layers (excluding the last one), which are also called hidden layers, can be chosen to be an arbitrarily number. The number of dimensions of the last layer, also called output layer, is determined by the dimensionality of the prediction. The number of hidden layers and their corresponding dimension changes the performance of the system. \begin{figure}[H] \begin{center} @@ -39,7 +39,7 @@ \end{center} \end{figure} -There is no way of knowing how many dimensions and layers will give you the best performance, as one can only define general effects of what happens when they are being modified. Generally, increasing the number of layers enables the system to solve more complex problems, while more dimensions make the system more flexible. However, even these general guidelines are to be applied with caution. For example, adding too many layers can cause the system to train exceedingly slow, whilst adding to many neurons with a too small training set can result in overfitting\footnote{When a system performs well on the training set but poorly on the test set}. Depending on the problem and the data given, each has its own optimal configuration. By gaining more experience with NN, people can take better guesses where to start. However, in the end it always results in some sort of systematic trial and error to find the optimal configuration.\\ +There is no way of knowing how many dimensions and layers will give you the best performance, as one can only define general effects of what happens when they are being modified. Generally, increasing the number of layers enables the system to solve more complex problems, while more dimensions make the system more flexible. However, even these general guidelines are to be applied with caution. For example, adding too many layers can cause the system to train exceedingly slow, whilst adding to many neurons with a too small training set can result in overfitting\footnote{When a system performs well on the training set but poorly on the test set}. Depending on the problem and the data given, each has its own optimal configuration. By gaining more experience with NN, people develop a better intuition of where to start. However, in the end it always results in some sort of systematic trial and error to find the optimal configuration.\\ \subsubsection{Activation functions} @@ -79,12 +79,12 @@ \subsubsection{Concepts of training} The neural network is trained with a sample of events. This sample consists of a few input parameters and a training target, which is the value the neural network will be trained to predict. Three important terms for the training of a neural network are epochs, batch size and loss function.\\ -An epoch refers to one training iteration, where all of the training samples get used once and the weights and biases get modified to fit the wanted targets better. Usually a system is trained over many epochs until the weights and biases stay approximately constant at their optimal values.\\ -Batch size refers to the number of examples that are given to the system at once during the training. Batch size should neither be chosen too small, e.g. small batch sizes train slower, nor too big, some randomness is wanted. Experience shows, that a reasonable batch size usually lies between 10 to 100 examples per batch. It is important to note that by decreasing batch size we make the minimum of the mapping we want to find wider. This makes finding the general area of the minimum easier. However if the minimum gets too wide, the slope gets to small to reach the minimum in a reasonable time. On the other side by increasing the batch size too much, the minimum gets exceedingly narrower and it possible to continuously keep "jumping" over the minimum with every training step performed. +An epoch refers to one training iteration, where all of the training samples get used once and the weights and biases get modified to fit the wanted targets better. Usually, a system is trained over many epochs until the weights and biases stay approximately constant at their optimal values.\\ +Batch size refers to the number of examples that are given to the system at once during the training. Batch size should neither be chosen too small, e.g. small batch sizes train slower, nor too big, some randomness is wanted. Experience shows, that a reasonable batch size usually lies between 10 to 100 examples per batch. It is important to note that by decreasing batch size we make the minimum of the mapping we want to find wider. This makes finding the general area of the minimum easier. However, if the minimum gets too wide, the slope gets too small to reach the minimum in a reasonable time. On the other side, by increasing the batch size too much, the minimum gets exceedingly narrower and it is possible for the system to continuously keep "jumping" over the minimum with every training step performed. \subsubsection{Loss functions} -To train the system, we need some way to parametrize the quality of our predictions. To account for that we use a loss function. A loss function takes the predicted values of the system and the targeted values to give us an absolute value of our performance. There are various loss functions. In the two RNN's "mean squared error"(MSE, formula \ref{MSE}) and "binary crossentropy"(BC, formula \ref{BC}) were being used. The goal of every NN is to minimize the loss function. +To train the system we need some way to parametrize the quality of our predictions. To account for that, we use a loss function. A loss function takes the predicted values of the system and the targeted values to give us an absolute value of our performance. There are various loss functions. In the two RNN's "mean squared error"(MSE, formula \ref{MSE}) and "binary crossentropy"(BC, formula \ref{BC}) were being used. The goal of every NN is to minimize the loss function. \begin{align} L(w,b) = \frac{1}{n} \sum^n_{i=1} (\hat{Y}_i(w_i,b_i) - Y_i)^2 @@ -105,7 +105,7 @@ \subsubsection{Stochastic gradient descent} -There exist several methods to minimize the loss. The most simple one being stochastic gradient descent(SGD). When performing SGD we can calculate the gradient and just apply it to our weights and biases. By doing this repeatedly, we will eventually end up in a minimum\footnote{It is very possible to also just get stuck in a local minimum}. +Several methods exist to minimize the loss. The most simple one being stochastic gradient descent(SGD). When performing SGD, we can calculate the gradient and just apply it to our weights and biases. By doing this repeatedly, we will eventually end up in a minimum\footnote{It is very possible to also just get stuck in a local minimum}. \subsubsection{Stochastic gradient descent with Momentum} @@ -117,8 +117,7 @@ \subsubsection{Adam} -The most commonly used algorithm however, is the Adam algorithm \cite{chilimbi2014project}, which stands for Adaptive Moment estimation, training algorithm (see formulas \ref{adam_alg}). Is is essentially a combination of Momentum and RMSProp and takes the best of both. It is also the one used to train both RNN's of this thesis as it converges the quickest and most reliable to the global minimum. The algorithm contains two moments. The first moment is an exponentially decaying average of past gradients as in Momentum while the second -moment is an exponentially decaying average of past squared gradients as in RMSProp. +The most commonly used algorithm however, is the Adam algorithm \cite{chilimbi2014project}, which stands for Adaptive Moment estimation, training algorithm (see formulas \ref{adam_alg}). It is essentially a combination of Momentum and RMSProp and takes the best of both. It is also the one used to train both RNN's of this thesis as it converges the quickest and most reliable to the global minimum. The algorithm contains two momenta. The first moment is an exponentially decaying average of past gradients, as in Momentum. On the other hand, the second moment is an exponentially decaying average of past squared gradients as in RMSProp. \begin{center} @@ -157,7 +156,7 @@ \subsubsection{Batch normalisation} -Another important technique often used in NN is Batch Normalisation \cite{ioffe2015batch}, \cite{cooijmans2016recurrent}. By performing Batch Normalization we normalize and center the input around zero in between every layer of the NN. Batch Normalization has proven to be a potent technique to make NN train faster and even perform better. +Another important technique often used in NN is Batch Normalisation \cite{ioffe2015batch}, \cite{cooijmans2016recurrent}. By performing Batch Normalization, we normalize and center the input around zero in between every layer of the NN. Batch Normalization has proven to be a potent technique to make NN train faster and even perform better. \begin{figure}[H] \begin{center} @@ -171,9 +170,9 @@ \subsubsection{General concepts} -Recurrent Neural Networks(RNN) are subclass of neural networks and are specialised to deal with sequential data structures. There are various applications for RNN's such as speech recognition, music generation, sentiment classification, DNA sampling and so on. Generally normal NN don't perform that well on sequential data. One of the reasons is for example that it doesn't share features learned across different positions in the data\footnote{In our experiment positions of the particles with x,y,z in the detector}. Another problem is that the input and output don't necessarily have to have the same length every time.\\ -It is important to note that when using RNN's what the units we called neurons before are usually called cells.\\ -RNN's pose a much better representation of the data which also helps reducing the number of variables in the system and hereby make it train more efficiently. +Recurrent Neural Networks(RNN) are subclass of neural networks and are specialised to deal with sequential data structures. There are various applications for RNN's, such as speech recognition, music generation, sentiment classification, DNA sampling and so forth. Generally, normal NN don't perform that well on sequential data. One of the reasons is, for example, that it doesn't share features learned across different positions in the data\footnote{In our experiment positions of the particles with x,y,z in the detector}. Another problem is, that the input and output don't necessarily have to have the same length every time.\\ +It is important to note, that, when using RNN's the units we called neurons before are usually called cells.\\ +RNN's pose a much better representation of the data, which also helps reducing the number of variables in the system and hereby make it train more efficiently. \begin{figure}[H] \begin{center} @@ -192,25 +191,25 @@ \item $ a^{\langle t \rangle}$: Information passed over from the last step \end{itemize} -In figure \ref{RNN_arch} the general architecture of a RNN can be seen. Every step of the input data ($ x^{\langle t \rangle}$) gets sequentially fed into the RNN which then generates some output $ \hat{y}^{\langle t \rangle}$ after every step of the input. To share already learned information and features for future steps, $ a^{\langle t \rangle}$ gets passed down as additional input into the RNN for the next step. +In figure \ref{RNN_arch} the general architecture of a RNN can be seen. Every step of the input data ($ x^{\langle t \rangle}$) gets sequentially fed into the RNN, which then generates some output $ \hat{y}^{\langle t \rangle}$ after every step of the input. To share already learned information and features for future steps, $ a^{\langle t \rangle}$ gets passed down as additional input into the RNN for the next step. \subsubsection{Most common architectures} There are two concepts of how the data is fed into the system and three structures of RNN's depending on the input and output of the system.\\ Usually, the data is fed into the system step by step. For problems, where not the entire sequence is known already at the start, this is the only way to feed the data into the system.\\ -If however, the entire sequence is already known at the beginning, e.g. in sequence classification, often the information is read by the system forwards and backwards. Networks with this specific architecture are called bidirectional RNN's \cite{schuster1997bidirectional}. This often increases the systems performance.\\ +If however, the entire sequence is already known at the beginning, e.g. in sequence classification, the information is commonly read by the system forwards and backwards. Networks with this specific architecture are called bidirectional RNN's \cite{schuster1997bidirectional}. This often increases the systems performance.\\ However, as with the first RNN, we wanted to predict particle tracks after leaving the detector, we could only use a one directional RNN as the whole track wasn't available. The second RNN is actually a classifier of the tracks. With the whole information available from the start, it was designed to be a bidirectional RNN.\\ A system has a "many-to-one" architecture, if we have a sequential input but we only care about the final output of the system, e.g. classification problems. This is the architecture used for both RNN's. With the same reasoning, if we have sequential inputs and want care about the output generated at each step, e.g. speech recognition, the architecture is called "many-to-many". A "one-to-one" architecture is basically just a regular NN. \subsubsection{Cell types} -Besides the basic RNN cell type, which shall not be discussed in detail in this thesis, the two most influential and successful cell types are Long-Short-Term-Memory(LSTM) \cite{gers1999learning} cells and Gated Recurrent Units(GRU) \cite{chung2014empirical}. However, in this thesis only LSTM cells will be explained in greater detail as the were the only cells used in the RNN's.\\ +Besides the basic RNN cell type, which shall not be discussed in detail in this thesis, the two most influential and successful cell types are Long-Short-Term-Memory(LSTM) \cite{gers1999learning} cells and Gated Recurrent Units(GRU) \cite{chung2014empirical}. However, in this thesis only LSTM cells will be explained in greater detail as they were the only cells used in the RNN's.\\ -GRU's were invented with the intention to create a cell type with a similar performance to the LSTM cell while having a simpler internal structure. By being less complex as an LSTM cell a GRU cell has also less parameters to modify during training which also speeds up training.\\ +GRU's were invented with the intention to create a cell type with a similar performance to the LSTM cell, while having a simpler internal structure. By being less complex as an LSTM cell, a GRU cell has also less parameters to modify during training which also speeds up training.\\ -LSTM cells (see figure \ref{LSTM_arch}) have many useful properties such as a forget gate, an update gate as well as an output gate. With this cell type, it is easy to pass down information for the following steps without it being altered in a big way (Long term memory). However, there are also ways built in to update this passed down information with new one (Short term memory). Even though GRU's are gaining more and more traction, LSTM-cells are still widely considered to be the most successful type of cells. +LSTM cells (see figure \ref{LSTM_arch}) have many useful properties such as a forget gate, an update gate, as well as an output gate. With this cell type, it is easy to pass down information for the following steps without it being altered in a big way (Long term memory). However, there are also ways built in to update this passed down information with new one (Short term memory). Even though GRU's are gaining more and more attention, LSTM-cells are still widely considered to be the most successful type of cells. \begin{figure}[H] \begin{center} @@ -236,4 +235,4 @@ XGBoost\cite{ML:XGBoost} is based on boosted decision trees (extreme gradient boosting). In this approach, the data samples get split using a decision tree. With every step a new tree gets created to account for the errors of prior models, which are then added to create the final prediction. A gradient descent algorithm is used to minimize loss when adding new trees. \\ -It's is often used as a classifier. However, it can also used in regression models. In this thesis, an XGBoost classifier was used to determine a baseline and have some comparison for our bidirectional RNN classifier. \ No newline at end of file +It is often used as a classifier. However, it can also used in regression models. In this thesis, an XGBoost classifier was used to determine a baseline and have some comparison for our bidirectional RNN classifier. \ No newline at end of file diff --git a/Report/05_Data.tex b/Report/05_Data.tex index e18b12d..e6aec64 100644 --- a/Report/05_Data.tex +++ b/Report/05_Data.tex @@ -1,21 +1,21 @@ \section{Data} \subsection{General information} -There were two sets of data used in this thesis. First, each of the datasets were shuffled to counteract any bias given by the sequence of the data and then split into two parts. $80\%$ was used to train the model(training set) while the remaining $20\%$ were later used to test the model(test set).\\ +There were two sets of data used in this thesis. First, each of the datasets were shuffled to counteract any bias given by the sequence of the data and then split into two parts. $80\%$ was used to train the model(training set), while the remaining $20\%$ were later used to test the model(test set).\\ The sets were created using a Geant4 \cite{agostinelli2003s} based simulation with the specific configuration of the $\mu \rightarrow 3e$-experiment configuration.\\ The first dataset(dataset 1) contained 46896 true 8-hit tracks of recurling particles, and each hit consisting of 3 coordinates (x,y,z).\\ -The second dataset(dataset 2) contained 109821 tracks. These were exclusively tracks that the current track reconstruction algorithm wasn't conclusively able to assign to an event. As a result, every event contained all the preselected tracks, computed by the already existing algorithm, that were calculated to be a possible track. It is important to note that only for around $75\%$ of the events, the true track was in this preselection. This posed an additional challenge, as one could not just simply chose the best fitting track. To assign the tracks to their corresponding events, they all carried an event number with them matching them with their event.\footnote{One number for all tracks of the same events}. Each track contained the coordinates of the 8 hits (x,y,z), the value of the $\chi^2$-fit performed by the reconstruction algorithm, the event number as well as a label which told us if the track was true or false\footnote{Only used for training and testing of the system}. +The second dataset(dataset 2) contained 109821 tracks. These were exclusively tracks that the current track reconstruction algorithm wasn't conclusively able to assign to an event. As a result, every event contained all the preselected tracks, computed by the already existing algorithm, that were calculated to be a possible track. It is important to note, that only for around $75\%$ of the events the true track was in this preselection. This posed an additional challenge, as one could not just simply choose the best fitting track. To assign the tracks to their corresponding events, they all carried an event number matching them with their event.\footnote{One number for all tracks of the same events}. Each track contained the coordinates of the 8 hits (x,y,z), the value of the $\chi^2$-fit performed by the reconstruction algorithm, the event number, as well as a label which told us if the track was true or false\footnote{Only used for training and testing of the system}. \subsection{Preprocessing} \subsubsection{Dataset 1} -To optimize the data fed into the RNN, dataset 1 was preprocessed. In a first step, a min-max scaler with a range of $[-0.9,0.9]$ from the python library Scikit-learn \cite{pedregosa2011scikit} was used. This particular choice of range was based on the fact that a $tanh$ activation function was used in the output layer. To accommodate for its properties of being asymptotically bounded by $\pm 1$ we chose a range of $[-0.9,0.9]$ to make all the data easily reachable by the system. In a second step, the data got shuffled and split into the training and test sets. The first four steps were used as an input for the RNN while the second four steps were our prediction target. +To optimize the data fed into the RNN, dataset 1 was preprocessed. In a first step, a min-max scaler with a range of $[-0.9,0.9]$ from the python library Scikit-learn \cite{pedregosa2011scikit} was used. This particular choice of range was based on the fact that a $tanh$ activation function was used in the output layer. To accommodate for its properties of being asymptotically bounded by $\pm 1$, we chose a range of $[-0.9,0.9]$ to make all the data easily reachable by the system. In a second step, the data got shuffled and split into the training and test sets. The first four steps were used as an input for the RNN, while the second four steps were our prediction target. \subsubsection{Dataset 2} \label{dataset2} -Analogously to dataset 1, first the coordinates of the tracks as well as the $\chi^2$ were scaled with a min max scaler (separate ones) with a range of $[-0.9,0.9]$ from the python library Scikit-learn. Then the first four steps of every track were taken and fed into our first track predicting RNN. For each of the last four steps of a track we then had two sets of coordinates. One were the predicted coordinates of our RNN and the other one the coordinates given by the reconstructing algorithm. To have the information of the $\chi^2$ fit available at each step, we created an array of shape $(\#tracks, steps, 4)$ (1 dimension for each of the coordinates and another for the $\chi^2$ fit). However, at the spot of the x,y,z coordinates there were neither the predicted coordinates of our RNN nor the coordinates given by the reconstructing algorithm but instead the difference of the two. Our target was the truth value of each track\footnote{$1 =$ true, $0 =$ false}. +Analogously to dataset 1, first the coordinates of the tracks, as well as the $\chi^2$, were scaled with a min max scaler (separate ones) with a range of $[-0.9,0.9]$ from the python library Scikit-learn. Then, the first four steps of every track were taken and fed into our first track predicting RNN. For each of the last four steps of a track we then had two sets of coordinates. One were the predicted coordinates of our RNN and the other one the coordinates given by the reconstructing algorithm. To have the information of the $\chi^2$ fit available at each step, we created an array of shape $(\#tracks, steps, 4)$ (1 dimension for each of the coordinates and another for the $\chi^2$ fit). However, at the spot of the x,y,z coordinates there were neither the predicted coordinates of our RNN nor the coordinates given by the reconstructing algorithm but instead the difference of the two. Our target was the truth value of each track\footnote{$1 =$ true, $0 =$ false}. diff --git a/Report/06_RNN_used.tex b/Report/06_RNN_used.tex index bf6b06a..eff69c0 100644 --- a/Report/06_RNN_used.tex +++ b/Report/06_RNN_used.tex @@ -2,7 +2,7 @@ \subsection{RNN for track prediction} -The first RNN had the task to predict the positions of the recurled 4 hits. As input the 4 hits of an outgoing particle are used. +The first RNN had the task to predict the positions of the recurled 4 hits. As input, the 4 hits of an outgoing particle are used. \begin{figure}[h] \begin{center} @@ -22,8 +22,8 @@ \item[4. Layer:] Dense layer (12 cells) \end{itemize} -The optimal number of layers, cells and cell-type was found by systematically comparing RNN's that are equal besides one property (e.g. Using GRU's instead of LSTM cells). Also all the activation functions were chosen to be selu's.\\ -The loss and metric function used were the mean squared error(mse) as this had the most similarity with an euclidian distance. The model itself was trained by an Adam algorithm.\\ +The optimal number of layers, cells and cell-type was found by systematically comparing RNN's that are equal besides one property (e.g. Using GRU's instead of LSTM cells). Also, all the activation functions were chosen to be selu's.\\ +The loss and metric function used were the mean squared error(mse), as this had the most similarity with an euclidian distance. The model itself was trained by an Adam algorithm.\\ The output was a 12 dimensional vector of the shape: $(x_5, y_5, z_5, x_6, y_6, z_6, ..., z_8)$. Note that the numeration starts with 5 as the 5$^\text{th}$ hit of the track is the first one to be predicted. \subsection{RNN for classification of tracks} @@ -38,7 +38,7 @@ \item Value of the $\chi^2$ fit \end{itemize} -The output was then just a one dimensional vector, where $1$ stands for a true track and $0$ stands for a false track. The RNN itself is going to predict a number between $0$ and $1$, which can be interpreted as amount of confidence that it is a true track. +The output was then just a one dimensional vector, where $1$ stands for a true track and $0$ stands for a false track. The RNN itself is going to predict a number between $0$ and $1$, which can be interpreted as amount of confidence, that it is a true track. \begin{figure}[H] \begin{center} @@ -48,7 +48,7 @@ \end{center} \end{figure} -The RNN for the classification was chosen to be bidirectional and as in the RNN before LSTM cells were used. Here, a tanh was used for all the activation functions besides the last one. The last layer used a softmax activation function\footnote{Similar to a tanh but bounded between [0,1]}. As tanh doesn't automatically do batch normalization, between every layer of cells a batch normalization layer was added.\\ +The RNN for the classification was chosen to be bidirectional and as in the RNN before LSTM cells were used. Here, a tanh was used for all the activation functions, besides the last one. The last layer uses a softmax activation function\footnote{Similar to a tanh but bounded between [0,1]}. As tanh doesn't automatically do batch normalization, between every layer of cells a batch normalization layer was added.\\ The layout of the layer was as follows: \begin{itemize} diff --git a/Report/07_Analysis.aux b/Report/07_Analysis.aux index 429c1b2..d099b46 100644 --- a/Report/07_Analysis.aux +++ b/Report/07_Analysis.aux @@ -1,6 +1,6 @@ \relax \providecommand\hyper@newdestlabel[2]{} -\@writefile{toc}{\contentsline {section}{\numberline {8}Results}{38}{section.8}} +\@writefile{toc}{\contentsline {section}{\numberline {8}Analysis}{38}{section.8}} \@writefile{toc}{\contentsline {subsection}{\numberline {8.1}Best $\chi ^2$}{38}{subsection.8.1}} \@writefile{toc}{\contentsline {subsection}{\numberline {8.2}RNN classifier with RNN track prediction input}{38}{subsection.8.2}} \newlabel{RNN_tp_fp_hist}{{14a}{39}{Number of false positives and false negatives depending cut\relax }{figure.caption.18}{}} diff --git a/Report/07_Analysis.tex b/Report/07_Analysis.tex index f89c575..16650a0 100644 --- a/Report/07_Analysis.tex +++ b/Report/07_Analysis.tex @@ -1,10 +1,10 @@ -\section{Results} +\section{Analysis} \subsection{Best $\chi^2$} -The most simple version to try to classify which one is the right path out of the preselection would be to just take the path with the smallest $\chi^2$. Like this, we would choose the path that agrees the most with the track reconstructing algorithm that gives us our preselection. However, as already mentioned, in dataset 2 only around $75\%$ of the events even have the true track among the ones preselected by the reconstruction\footnote{E.g. by not having all 8 hits as a result of detector efficiency (searches for 8 hits)}. In this case we would have to label all the tracks as false tracks. By simply choosing the best $\chi^2$ we don't account for this at all. So, by default our maximum accuracy would be around $75\%$ if the true track would really always just be the one with the best $\chi^2$.\\ +The most simple version to try to classify which one is the right path out of the preselection would be to just take the path with the smallest $\chi^2$. Like this, we would choose the path that agrees the most with the track reconstructing algorithm that gives us our preselection. However, as already mentioned in dataset 2, only around $75\%$ of the events even have the true track among the ones preselected by the reconstruction\footnote{E.g. by not having all 8 hits as a result of detector efficiency (searches for 8 hits)}. In this case, we would have to label all the tracks as false tracks. By simply choosing the best $\chi^2$, we don't account for this at all. So, by default our maximum accuracy would be around $75\%$ if the true track would really always just be the one with the best $\chi^2$.\\ -It turns out the accuracy of this method is only at $52.01\%$. So, there is a need for better algorithms to classify this problem. +It turns out, that the accuracy of this method is only at $52.01\%$. Therefore, there is a need for better algorithms to classify this problem. \subsection{RNN classifier with RNN track prediction input} @@ -50,9 +50,9 @@ \end{center} \end{figure} -In figure \ref{XGB_tp_fp_hist} the blue bins are false positives and the orange bins are false negatives. Here we see that the bins are more evenly spread and gather less at the edges. So, already qualitatively we can guess that it will perform worse than our RNN's.\\ +In figure \ref{XGB_tp_fp_hist} the blue bins are false positives and the orange bins are false negatives. Here we see that the bins are more evenly spread and gather less around the edges. So, already qualitatively we can guess that it will perform worse than our RNN's.\\ -Figure \ref{XGB_ROC} shows the ROC curve of the XGB classifier. Generally, the more area under the ROC curve the better the classifier. In the perfect case, where everything gets labelled $100\%$ correctly, the area under the curve would be 1. Here we have an area of $0.88$.\\ +Figure \ref{XGB_ROC} shows the ROC curve of the XGB classifier. Here we have a ROC AUC of $0.88$.\\ \subsection{Comparison in performance of the RNN and XGBoost} @@ -80,10 +80,10 @@ RNN & $87.63\%$ & 0.93 \end{tabular}\\ -Using this system of RNN's proves to be viable solution to this problem and brings a huge jump in accuracy also over other machine learning solutions. +Using this system of RNN's proves to be a viable solution to this problem and brings a huge gain in accuracy, while also outperforming other machine learning solutions. \subsection{Outlook and potential} -Where do we want to go from here? One way to improve the algorithm would for example be to create a fully connected neural network \cite{gent1992special}. By doing this both RNN's would be connected and would train as a unit. This would have the positive effect of not having to retrain the classifying RNN as well whenever the first on gets changed. \\ -Another goal could be to make this type of RNN appliable to more types of problems. So for example, instead of being restricted to tracks of a specific length (here eight hits) one could make it more general to be able to deal with an arbitrary length of the track. This would be especially useful for this experiment, as a lot of particles don't just recurl once but many times over (in the central station). Hereby, they are creating a lot of background, which minimalizing is crucial to reach our desired sensitivity of $10^{-16}$.\\ +Where do we want to go from here? One way to improve the algorithm would for example be to create a fully connected neural network \cite{gent1992special}. By doing this, both RNN's would be connected and would train as a unit. This would have the positive effect of not having to retrain the classifying RNN as well whenever the first on gets modified.\\ +Another goal could be to make this type of RNN appliable to more types of problems. For example, instead of being restricted to tracks of a specific length (here eight hits) one could make it more general to be able to deal with an arbitrary length of the track. This would be especially useful for this experiment, as a lot of particles don't just recurl once but many times (in the central station). Hereby, they are creating a lot of background, which minimalizing is crucial to reach our desired sensitivity of $10^{-16}$.\\ The ultimate goal however, would be to replace the current track reconstruction algorithm altogether and put a RNN in its place. This could for example be done by an RNN performing beam search\footnote{Both inside out and outside in} \cite{graves2013speech} to find the true track of a particle. In other areas, beam search has proven to be a powerful tool and there is a lot of potential for this sort of algorithm in physics as well, especially in track reconstruction \ No newline at end of file diff --git a/Report/Presentation/Bachelor_thesis_defense_presentation.pptx b/Report/Presentation/Bachelor_thesis_defense_presentation.pptx new file mode 100644 index 0000000..50a7321 --- /dev/null +++ b/Report/Presentation/Bachelor_thesis_defense_presentation.pptx Binary files differ