Presentations/Seminars/Edinburgh/edin.tex at 62126323ae48e6645a6a6d31d9a17d79c107f1e2

Fork: 0
mchrzasz / Presentations
Find file
Newer
Older
Presentations / Seminars / Edinburgh / edin.tex
Marcin Chrzaszcz on 22 Dec 2013 14 KB updated presentation, after blending presetantion
Raw Blame History
\documentclass[]{beamer}
\setbeamertemplate{navigation symbols}{}
\usepackage{beamerthemesplit}
\useoutertheme{infolines}
\usecolortheme{dolphin}
%\usetheme{Warsaw}
\usetheme{progressbar} 
\usecolortheme{progressbar}
\usefonttheme{progressbar}
\useoutertheme{progressbar}
\useinnertheme{progressbar}
\usepackage{graphicx}
%\usepackage{amssymb,amsmath}
\usepackage[latin1]{inputenc}
\usepackage{amsmath}
\newcommand\abs[1]{\left|#1\right|}
\usepackage{iwona}
\usepackage{hepparticles}
\usepackage{hepnicenames}
\usepackage{hepunits}
\progressbaroptions{imagename=images/mlerning.png}
%\usetheme{Boadilla}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
\definecolor{mygreen}{cmyk}{0.82,0.11,1,0.25}
\setbeamertemplate{blocks}[rounded][shadow=false]
\addtobeamertemplate{block begin}{\pgfsetfillopacity{0.8}}{\pgfsetfillopacity{1}}
\setbeamercolor{structure}{fg=mygreen}
\setbeamercolor*{block title example}{fg=mygreen!50,
bg= blue!10}
\setbeamercolor*{block body example}{fg= blue,
bg= blue!5}





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\beamersetuncovermixins{\opaqueness<1>{25}}{\opaqueness<2->{15}}
\title{Machine learning - from theory to practice}  
\author{Marcin Chrzaszcz$^{1,2}$}
\date{\today} 

\begin{document}

{
\institute{$^1$ University of Zurich, $^2$ Institute of Nuclear Physics in Krakow}
\setbeamertemplate{footline}{} 
\begin{frame}
\logo{
\vspace{2 mm}
\includegraphics[height=1cm,keepaspectratio]{images/uzh.jpg}{~}{~}
\includegraphics[height=1cm,keepaspectratio]{images/ifj.png}
}

  \titlepage
\end{frame}
}

\institute{UZH,IFJ} 


\section[Outline]{}
\begin{frame}
\tableofcontents
\end{frame}

%normal slides

	
\section{Introduction}

\begin{frame}\frametitle{Lets start from a joke}
\only<1>{
Q: What is the difference between a physicist and a big pizza?\\
{~}

}
\only<2>{
Q: What is a difference between a physicist and a big pizza?\\
A: Pizza is enough to feed the full family.

}


\end{frame}





%\subsection{Folding technique}

\begin{frame}\frametitle{What is machine learning}
{~}
\begin{center}
\only<1>
{
\begin{enumerate}
\item Machine learning:
\begin{itemize}
\item It is a science about how to construct a system that can learn from data. 
\end{itemize}
\end{enumerate}
{~}\\{~}\\{~}\\{~}\\{~}\\{~}
}
\only<2>
{
\begin{enumerate}
\item Machine learning:
\begin{itemize}
\item It is a science about how to construct a system that can learn from data. 
\end{itemize}
\item To be less precise but more intuitive it helps you solve problems like:
\begin{itemize}
\item  Predict the price of a stock in 6 months from now, on the basis of company performance measures and economic data.
\item Identify the numbers in a handwritten ZIP code, from a digitized image.
\item etc.
\end{itemize}

\end{enumerate}

}

\only<3>
{
A simple example:\\

\includegraphics[scale=.25]{images/example1.png}

}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\end{center}
\end{frame}
\section{Linear Models for regression}
\begin{frame}\frametitle{Linear Models}
{~}
\begin{itemize}
\item Let's assume we have a vector of inputs: $X^T =(X_1,X_2,...,X_p)$.
\item We predict the output of our machine/classifiers:
\begin{equation}
\widehat{Y} = \beta_0 + \sum\limits_{j=1}^p \beta_j X_j = \sum\limits_{j=0}^p \beta_j X_j = X^T \widehat{\beta}
\end{equation}
\item To fit this one could use the method of least squares:
\begin{equation}
RSS(\beta)= \sum\limits_{j=1}^n (y_i-x_i^T\beta)^2
\end{equation}
\item It's a quadratic function in $\beta$ so minimum exists.
\end{itemize}


\end{frame}
\begin{frame}\frametitle{Linear Models - Example}
{~}
Probably I have already managed to bore you, so let's look at an example:
\begin{itemize}
\item We have two pairs of simulated data: $X_1$, $X_2$
\item A linear regression was fit to these data.
\item Response $\widehat{Y}$ color coded:
\begin{equation}
\widehat{G}(\widehat{Y}) =
  \begin{cases}
   \color{orange}{orange} & \text{if } \widehat{Y} \geq 0.5 \\
    \color{blue}{blue}      & \text{if } \widehat{Y}  < 0.5
  \end{cases}
\end{equation}
\end{itemize}

\end{frame}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{frame}\frametitle{Linear Models - Example}
\only<1>{
\begin{columns}
\column{3in}

\begin{center}
\includegraphics[scale=.25]{images/linear.png}

\end{center}


\column{2in}

\begin{itemize}
\item We see that in $\mathcal{R}^2$ space we used the boundary ${x:x^T\widehat{\beta}=0.5}$
\item There exists a number of points that have been misclassified on both sides. 
\item Looks like our linear model is not appropriate.

\end{itemize}
\end{columns}
}

\only<2>
{
We didn't say anything about the two test samples. The usual scenarios:
\begin{enumerate}
\item The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.
\item The training data in each class came from a mixture of 10 low-variance Gaussian distributions, with individual means themselves distributed as Gaussian

\end{enumerate}
I use Gaussian because it's easy to generate and has a nice interpretation.


}
\end{frame}

\begin{frame}\frametitle{Nearest-Neighbor Method}

\begin{itemize}
\item Nearest-neighbor methods use those observations in the training set of $k$ closest in input space to x to construct $\widehat{Y}$:
\begin{equation}
\widehat{Y}(x)=\dfrac{1}{k} \sum\limits_{x_i \in N_k(x)} y_i ,
\end{equation}
where $N_k(x)$ is the neighborhood of x defined by the k closest points $x_i$ in the training sample.
\item Let's assume a Euclidean metric and calculate and repeat the same example but with a new function.
\item For example we can put $k=15$ and $k=1$.
\end{itemize}

\end{frame}

\begin{frame}\frametitle{Nearest-Neighbor Example}
\only<1>{
\begin{center}
\begin{columns}
\column{2.5in}
\begin{center}
$k=15$\\{~}\\
\includegraphics[scale=.18]{images/n15.png}
\end{center}
\column{2.5in}
\begin{center}
\begin{itemize}
\item Fewer training observations are misclassified. 
\item This should not give you too much hope!
\item See next example.

\end{itemize}
\end{center}
\end{columns}
\end{center}
}
\only<2>{
\begin{center}
\begin{columns}
\column{2.5in}
\begin{center}
$k=1$\\{~}\\
\includegraphics[scale=.18]{images/n1.png}
\end{center}
\column{2.5in}
\begin{center}
\begin{itemize}
\item No points are misclassified!
\item Clearly this doesn't tell you anything about the real distribution.
\item You should always check your methods on a testing sample.
\item aka train on half of the data and apply classifier to the second half and see if they agree
\end{itemize}
\end{center}
\end{columns}
\end{center}
}


\end{frame}

\begin{frame}\frametitle{Bias-Variance Tradeoff}
{~}
\only<1>{
\begin{itemize}
\item All methods I have described so far have a parameter that needs to be tuned.
\item There are two competing forces. 
\item The trick is to balance the effect. 
\item Let's make an example based on Nearest-Neighbor.
\item The test error ($Y=f(X)+\epsilon$):
\end{itemize}
\begin{equation}
EPE_k(x_0) = \sigma^2+[f(x_0)-\dfrac{1}{k} \sum\limits_{l=1}^k f(x_l)]^2+\dfrac{\sigma^2}{k},
\end{equation}
where $\sigma=Var(\epsilon)$ 
}

\only<2>{
\begin{itemize}
\item All the methods that I already described have a parameters that needs to be tuned.
\item There are two competing forces. 
\item The trick is balance the effect. 
\item Let make an example based on Nearest-Neighbor.
\item The test error ($Y=f(X)+\epsilon$):
\end{itemize}
\begin{equation}
EPE_k(x_0) = \sigma^2+ \underbrace{[f(x_0)-\dfrac{1}{k} \sum\limits_{l=1}^k f(x_l)]^2}_\text{bias}+\underbrace{\dfrac{\sigma^2}{k}}_\text{var}
\end{equation}

}

\only<3>{
\begin{equation}
EPE_k(x_0) = \sigma^2+ \underbrace{[f(x_0)-\dfrac{1}{k} \sum\limits_{l=1}^k f(x_l)]^2}_\text{bias}+\underbrace{\dfrac{\sigma^2}{k}}_\text{var}
\end{equation}
\begin{itemize}
\item Bias component tend to blow up a k increases.
\item On the other hand the variance term decreases as k increases.
\item We are basically balancing on the edge.
\end{itemize}
}
\only<4>
{
\begin{columns}
\column{3in}
\includegraphics[scale=.18]{images/tradeoff.png}


\column{2in}

$ EPE_k(x_0) = \sigma^2+$\\
$\underbrace{[f(x_0)-\dfrac{1}{k} \sum\limits_{l=1}^k f(x_l)]^2}_\text{bias}+\underbrace{\dfrac{\sigma^2}{k}}_\text{var}$

\end{columns}
}

\end{frame}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Complex models}
\begin{frame}\frametitle{Other complex models}
Other complex models include:
\begin{itemize}
\item Neural Networks
\item Kernel Methods
\item Sparse Kernel Methods
\item Decision Trees
\item Graphs
\item Sampling methods
\item Mix methods
\item Principal Component Analysis
\item Many others.
\end{itemize}


\end{frame}


\begin{frame}\frametitle{Neutral Networks}
\only<1>{
It's simple extension of the linear case:
\begin{equation}
Y(\textbf{x,w})=f(\sum\limits_{j=1}^M w_j\Phi_j(\textbf{x})),
\end{equation}
where:\\
$\Phi_j(\textbf{x})$ are basis functions,\\
$f(\centerdot)$ is a nonlinear activation function
}
\only<2>
{
\raggedright In practice: 
\begin{columns}
\column{2in}
\includegraphics[scale=.18]{images/NN.png}


\column{3in}
\begin{itemize}
\item Construct linear combinations:\\
$a_j=w_{ji} x_i+w_{J0}$
\item Each of the activations ($a_j$) we transform with non-linear function:
\begin{center} $z_j=h(a_j)$ \end{center}
\item Output of this function are called hidden units.
\item $h()$ is usually a sigmoidal function.
\item Then again you construct a linear combination of variables: $a_k=w_{ki} x_i+w_{k0}$ and again put inside the activation function. 
\end{itemize}
\end{columns}
}
\only<3>
{
\raggedright  A real world example:
\includegraphics[scale=.18]{images/seminar_MLP.png}


}


\end{frame}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5

\begin{frame}\frametitle{Decision trees}
\only<1>{
\begin{columns}
\column{2.8in}
{~}\includegraphics[scale=.21]{images/dtree.png}

\column{2.1in}
\begin{itemize}
\item Flow chart(First trees were calculated by hand)
\item Decisions are dependent on previous step.
\item Easy to use.
\item Learning converges fast.
\item Usually one trains 1k of trees for clarifier.
\end{itemize}


\end{columns}

}
\only<2>{
\begin{columns}
\column{2.3in}
{~}\includegraphics[scale=.18]{images/bdt3.png}\\
{~}\includegraphics[scale=.18]{images/bdt2.png}

\column{2.5in}
\begin{itemize}
\item Real example used in LHCb experiment.
\item Search for $\tau \to 3\mu$.
\item Trees are combined using unlikelihood. 
\item Most commonly used in HEP.
\end{itemize}


\end{columns}

}

\end{frame}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{New methods}
\begin{frame}\frametitle{Folding}
\begin{itemize}

\item When training one needs to use two samples: training and testing.
\item Training sample can't be used for analysis because of biases.
\item Normally one needs to throw some part of the data away just for training.
\item When ones considers costs throwing away $10\%$ of data is like throwing away 5M dollars a year 
\item Could we get that money/data back?
\end{itemize}
\end{frame}






\begin{frame}\frametitle{Folding}
{~}
\begin{center}

\begin{columns}
\column{2in}
\includegraphics[scale=.14]{images/data2.png}

\column{3in}
1. Reshuffling the events to guarantee the uniformity of the data.
\end{columns}

\begin{columns}
\column{2in}
\includegraphics[scale=.14]{images/data3.png}

\column{3in}
2. Chopping in sub-samples.
\end{columns}

\begin{columns}
\column{2in}
\includegraphics[scale=.14]{images/data4.png}

\column{3in}
3. Training using n-1 sub-samples and applying the result on the remaining one (iteratively) \\
Increase in the statistics used in the training (more stable MVA response), no bias in the result :-)
\end{columns}


\end{center}
\end{frame}

\begin{frame}\frametitle{Folding}
{~}
%\begin{center}
\begin{columns}

\column{2.5in}
\center \includegraphics[scale=.18]{images/FUCK.png}
\column{2.5in}
\begin{itemize}
\item Standard way to judge a classifier  is to look on the ROC(Receiver operating characteristic) curve.
\item Ones sees that not only one can use all data, but one gains with increasing number of folds.
\item Simply statistical explanation. More data to train makes fits inside the classifiers more stable(less sensitive to fluctuations)
\item One can tune the parameters of the classifier to "higher" values.
\end{itemize}

\end{columns}

\end{frame}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}\frametitle{Folding}
{~}
%\begin{center}
\begin{columns}

\column{2.5in}
\center \includegraphics[scale=.23]{images/MN_0p48.png}
\column{2.5in}
\begin{itemize}
\item Example from a recently studied channel: $\PB0 \to \PKstar \mu \mu$.
\item Using folding one reduced background from 500 events to 400. 
\item Background are extremely dangerous for this analysis.
\end{itemize}

\end{columns}

\end{frame}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{frame}\frametitle{Ensemble Selection}
{~}
%\begin{center}
\begin{columns}

\column{2.5in}
\center \includegraphics[scale=.26]{images/blend.png}
\column{2.5in}
\begin{itemize}
\item An ensemble is a collection of models whose predictions are combined by weighting or voting.
\item Add to the ensemble the model in the library that maximizes the ensembles performance. 
\item Repeat Step 2 for fixed number of iterations until all models are used.
\begin{enumerate}
\item In practice one can add all the classifiers to one single classifier.
\end{enumerate}

\end{itemize}

\end{columns}

\end{frame}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{frame}\frametitle{Ensemble Selection}
{~}
%\begin{center}
\begin{columns}

\column{2.5in}
\center \includegraphics[scale=.23]{images/BDT_comparison.png}
\column{2.5in}
\begin{itemize}
\item One clearly gains with using this classifier.
\item This is an extension to the Ensemble Selection for the search for $\tau \to 3\mu$.

\item $\tau$ leptons are produced in one of the given modes:
\begin{itemize}
\item $\PB \to \tau X$
\item $\PB \to \PD \to \tau X$
\item $\PB \to \PDs \to \tau X$
\item $\PDs \to \tau X$
\item $\PD \to \tau X$


\end{itemize}
\item One clearly gains using this approach :)
\end{itemize}

\end{columns}

\end{frame}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Applications}
\begin{frame}\frametitle{Nonscientific application}
{~}

\center \includegraphics[scale=.3]{images/example.png}

\end{frame}

%\section{Applications}
\begin{frame}\frametitle{Nonscientific application}
{~}
\begin{itemize}
\item Revenue prediction for each individual store
\end{itemize}

\center \includegraphics[scale=.33]{images/example2.png}

\end{frame}



\begin{frame}\frametitle{Conclusions}
{~}
\begin{itemize}
\item Machine learning is everywhere.
\item One of the fastest developing branches in mathematics.
\item Very profitable business :)
\item Market is there, so maybe for a living apart of hard core mathematics one should think about putting some time into machine learning?
\end{itemize}



\end{frame}




\end{document}