Master_thesis/thesis/app_performance.tex at 766486f562845ed77edc098dd293e138ddc8e7c0

Fork: 0
saslie / Master_thesis
Find file
Newer
Older
Master_thesis / thesis / app_performance.tex
Sascha Liechti on 30 Sep 2019 6 KB Minor updates
Raw Blame History
\section{Performance studies}

\subsection{Hardware specification}
\label{appendix:hardware specs}

The measurements are either performed on CPUs only or on 
an additional GPU. The following hardware and software stack has been used for 
Fig. \ref{fig:gperf}.

\begin{description}
	\item[CPU] 12 core Intel i7 8850H with 2.60 GHz, 6 cores are virtual using 
	hyper-threading. The available shared memory is 32 GB RAM, which was never 
	even half-way filled. While hyper-threading can be very useful for 
	applications where the bottleneck is not the actual computation by the CPU, 
	for HPC this is often not the case. As the experiments have shown, there is 
	only a minor difference between using 6 or 12 cores. Therefore, only 6 
	cores, the physical ones, are used in order to quantify the speedup 
	correctly.
	
	\item[GPU] Mobile Nvidia P1000 with 4GB RAM. It contains the same 
	processing unit as the consumer GTX 1050 series but is for professional 
	usage and performs more efficient float64 computations.
\end{description}

It is notable that the price of the GPU and the CPUs are roughly the same, 
which allows for some kind of comparison between them.

For Fig. \ref{fig:time events}, a cluster server with varying hardware was 
used. Eight cores were requested for the studies, though the workload of other 
jobs and the CPU type may have an impact on the results.

The tests were performed with the TensorFlow version 1.13. It was pre-built by 
anaconda and uses the MKL library. There is also a version with the Eigen 
library available, tests revealed differing performances for different tasks, 
around a factor of two in time. For the GPU version, CUDA 10.0 with cudnn was 
used.


\subsection{Profiling TensorFlow}
\label{appendix:profiling tensorflow}
Code consists of parallelised and serialised parts. While the speed of the 
former scales\footnote{Ideally. In reality, cores} with the number of cores, 
the 
latter does not. The total execution time $t$ is given by
\begin{equation}
t = \sum_i^{n_{s}} t_{s}^{(i)} + 
\sum_{i}^{n_{p}} 
(t_{p}^{(i)}/ n_{cpu} + t_{o}^{(i)}).
\end{equation}
where $n_s$ and $n_p$ are the number of \textit{serial} and \textit{parallel} 
parts, respectively. $t_s^{(i)}$ refers to the execution time of the $ith$ 
serial part, the $t_p^{(i)}$ for the parallel parts \text{if executed serial} 
and $t_o^{(i)}$ denotes the overhead that is needed for each 
parallel execution.

The serial part consists of
\begin{itemize}
	\item Reading in data from disk.
	\item Setup code such as building a model in \zfit{}.
	\item Global operations such as reductions on all values. For example 
	determining whether a stopping criteria such as the sum of all gradients 
	has gone below threshold is a serial operation.
\end{itemize}
while the parallel time $t_{parallel}$ contains usually the heavy computations: 
evaluating a function on data whereby the data can be split amongst the cores. 
The overhead for the parallel execution time includes
\begin{itemize}
	\item the overhead of creating a new thread for the parallel execution.
	\item the time to move data between the CPUs or even to the GPU.
\end{itemize}
The serial time $t_{serial}$ consists of
\begin{itemize}
	\item Bottlenecks in I/O or moving data
\end{itemize}

In order to achieve maximum performance and minimize $t$
\begin{itemize}
	\item there should be as few serial code execution time as possible. This 
	is though heavily limited by the logic and a \textit{certain} amount will 
	always be there.
	\item a minimum of splitting into serial and parallel parts should occur, 
	since each add a constant $t_o$ term.
\end{itemize}

This two points are often heavily conflicting and end up with the simple 
equation to describe when to parallelise
\begin{equation*}
	t_s^{(i)} - t_o^{(i)} > t_p^{(i)} / n_{cpu}
\end{equation*}
which reveals that even for large $n_cpu$, the overhead can be the decisive 
term. Furthermore, whether it is suitable to execute a piece of code serial or 
parallel depends on the $n_{cpu}$. Together with the difficulty of predicting 
the overhead time, this leaves \textit{just the decision} of whether a 
perfectly parallelisable piece of code actually should be run in parallel as a 
heuristic problem. TF uses as a strategy to find the optimal parallelisation to 
run a small simulation of the graph, thereby determining the overhead and the 
number of cores.

Since TF actually executes the computations, any 
execution time measurement highly reflects the performance of TF for this task. 
As TF itself is under active development, the 
performance in general is expected to improve in the future.

To get a reasonable estimate of what TF is capable of and somewhat avoid 
potential 
bottlenecks from \zfit{}, a dummy test function similar to a loss
was written in pure TF. The function creates three times one 
million of random numbers and does a few 
operations on them before adding and reducing them to a single number. This is 
added to the 
previous calculation in a loop 100 times. There are no I/O bottlenecks and, 
while not as an optimal example for TF, it seems reasonable to what can be 
expected in model fitting.
\newline
\begin{table}[tbp]
	\begin{center}
		\begin{tabularx}{0.7\textwidth}{ X | X | X | X }
						 & 1 CPU    & 6 CPU     & GPU       \\ \hline
			1 x problem  & 1.0 sec  & 0.27 sec  & 0.093 sec \\ \hline
			12 x problem & 13.4 sec & 3.3 sec   & 1.0 sec   \\ \hline
			
		\end{tabularx}
		
	\end{center}
	\caption{Execution time measurement of a loss-like function execution. The 
		complexity of the problem is scaled by $n$ times adding the same loss 
		again 
		to the reduce 
		function.}
\end{table}

We can see that the speedup is roughly a factor of 2/3 per core 
compared to the ideal case of 1. For example, the time from 1 CPU to 6 CPUs 
could 
be expected to decrease by a factor of 1/6 but does by 1/4 instead. While 
there are ways of building 
more efficient code, the example was chosen to reflect an arbitrarily, 
non-optimized implementation as expected to be found in \zfit{}, mostly with 
custom models.

\subsection{Additional profiling}
\label{appendix:additional profiling}

Performance studies have been conducted, not shown in Sec. 
\ref{sec:performance}. They are displayed here.

\begin{figure}[tbp]

	\includegraphics[width=\textwidth]{figs/gperf/loglog_9gauss_across_sampling_2param.png}
	\caption{Full toy study with sum of 9 Gaussians and 2 free parameters. We 
	can see that \zfit{} s temporary bottleneck in sampling causes an 
	extraordinary increase in execution time mostly for low number of events, 
	but the conclusions and the overall scaling behaviour is still the same as 
	described in Sec. \ref{sec:perf gaussian models}.}
	\label{fig:gperf across sampling}
\end{figure}