Karel Adámek | ManyCores blog

This post is a follow-up to our article about polyphase filter, which could be found in Astronomy & Computing (link) or on arxiv (link). In the article we have discussed details of our implementation as well as our results. Our code could be found on github (link).

The polyphase filter

The polyphase filter (PPF) is a two step algorithm. As a first step we apply a linear filter, which combines T previous time domain samples, we refer to them as raw samples, into one filtered sample. In the case of the polyphase filter the linear filter combines samples within a single channel of raw spectra and produces value of that channel in filtered spectra. This step is described by a finite impulse response (FIR) filter. The second step is to apply a discrete Fourier transformation (DFT) on a single filtered spectra which produces a single frequency spectra. These two steps are outlined as follows:

raw spectra → FIR → filtered spectra
filtered spectra → DFT → frequency spectra

For further description of data format we assume we refer to our article.

The FIR filter [1] (wiki) applied to whole spectra of (C) channels, that is 2D case, is given by a formula
$$
y[n]=\sum_{t=0}^{T-1}b[n+Ct]x[n-C(T-1-t)],
\label{equ:FIR}
$$
where (0 le n < C) is a channel index within a spectra, and (b) are a coefficients of a response function. Square brackets ([,]) indicate that a physical quantity is discrete (sampled), (x) represents samples from the input data belonging to the raw spectra and the quantity (y) represents samples in the filtered spectra. Quantities (y) and (x) are assumed to be complex.

The discrete Fourier transformation (DFT) forms the second step of the polyphase filter. The discrete Fourier transformation (DFT) of a time-domain signal (y[n]) is given [1] by equation
$$
Y[m]=\sum_{n=0}^{C-1}y[n]\exp{\frac{-i2\pi nm}{C}},
\label{equ:FFT}
$$
where (Y[m]) represents data in frequency domain and (C) represents the number of channels. For computation of the discrete Fourier transformation we are using Fast Fourier transformation (FFT) algorithm provided by NVIDIA in cuFFT library [2].

We have implemented the FIR part of the polyphase filter. This part is not limited to the polyphase filter alone and it could by used as a stand alone code for filtering any signals with any response function in time-domain signal processing.

New results

What we would like to show is performance of our polyphase filter on a new generation of NVIDIA cards. The Pascal generation. At the moment we are presenting results for the GTX 1080 and Pascal TITAN XP. As you can see the new GTX 1080 is in performance comparable to a TITAN X of the Kepler generation.

Results of our PPF for a wide range of taps and channels are shown below. All graphs show a horizontal line which describes the PCIe 3.0 bus bandwidth in terms of sample rate. These lines are for the theoretical value (grey) and 75% of the theoretical value which is an estimate of what could be realistically achieved. Sample rates above these lines indicate that a GPU is able to process all of the data we can transfer to the card per second i.e. data are processed in real time or quicker. Estimated number of samples per one SKA-mid beam is 64MHz (sampling at 64μs and 4096 frequency channels). This also means that there will be room for other signal processing steps as well as the polyphase filter. Some cards do not have results for all channels because they do not have enough memory to compute them. Note that the K80 has two physical GPUs, but our results use only one of these. Please note that we have not optimized PPF code for Pascal architecture and we are using the code for Kepler generation.

First two figures are for 32-bit version of the polyphase filter. The new GTX 1080 card is for small number of taps, where card is limited by global memory bandwidth comparable with TITAN X (Kepler). We can also see that the card is limited by the global memory bandwidth up to 24 taps. This shows abundance of shared memory bandwidth and reserve in number of floating operations card can use because it is able to cope with increasing computational operations and memory accesses demanded by FIR filter with a higher number of taps. For TITAN X this number was only 12 taps. New TITAN XP is almost 60% faster then GTX 1080. Similarly card manifest significant decrease in performance from 24 taps. Performance of both card drops significantly after 80 taps because shared memory code runs out of memory and we do not have results from or cache code, which takes over for very high number of taps as it can be seen with other GPUs.

Performance of our PPF on different NVIDIA cards and Xeon Phi (KNC) for 32bit input data. The GTX 1080 is comparable in performance with TITAN X (Kepler) from previous generation.

Same situation as on the previous figure is repeated here. GTX1080 is in performance almost identical to TITAN X. The drops in sample rate are caused by poor performance of the FFT for number of channels which are not a power of two. New TITAN XP is almost 60% faster than GTX 1080.

For the 16bit polyphase filter the situation is similar. The 16bit kernel works well even if it is not optimized for the Pascal generation. The new TITAN XP is almost 50% faster than GTX 1080.

Performance of our PPF on different NVIDIA cards for 16bit input data with changing number of taps.

Performance of our PPF and as it depends on number of channels. The GTX 1080 is again close to the TITAN X.

Lastly, we present performance of 8bit kernel. Performance increase is similar to previous generations. New TITAN XP is again about 50% faster then GTX 1080.

SampleRate8bit_taps — Performance of our PPF on different NVIDIA cards for 8bit input data with changing number of taps.

SampleRate8bit_channels — Performance of our PPF and as it depends on number of channels. The GTX 1080 is again close to the TITAN X (Kepler).

In the following figure we are showing the comparison of cuFFT. The y-axis is showing the speed-up against the GTX 980 Ti card for a different number of channels. As one can see the Titan pascal card is approximately twice quicker in then the reference card.

Text — Relative comparison of the speed-up of the cuFFT against the GTX 980 Ti (normed to 1) with different cards.

[1] Lyons, R. G., 2011. Understanding digital signal processing 3rd ed. Prentice Hall

ManyCores blog

All posts by Karel Adámek

A polyphase filter

The polyphase filter

New results