A reader asked why I created a separate DSP board instead of just added analogue inputs and outputs to the Raspberry Pi and do the DSP processing in software on the Raspberry Pi CPU. I never though about this option, but they might be interesting.
Independent from the processing power of the Raspberry CPU (we will come to this later), one problem is the operating system. The OS does not have realtime capabilities, which means the OS kernel can block every user process as long as it wants. However, this is more a theoretical problem, than a practical one. Using some buffering this will most likely work well. But do not expect delays in the range < 1ms, they will be longer than this!
The CPU is running at 700MHz and can be clocked even up to 1GHz which is way faster than the DSP chip on the HiFiBerry DSP light. This chip runs at 50MHz only. However, clock rate does not say a lot about performance. DSP chips are specially created for the algorithms use in digital signal processing, while a normal CPU is optimized to do most of the tasks reasonably well. Let’s have a look on the Raspberry Pi CPU: The SoC is produced by Broadcom and uses a ARM1176JZ-F core. It is based on the ARMv6 architecture, which is quite old (first CPUs based on this architecture shipped in 2002). One good thing about it is, that it features a floating point numerical co-processor – a VFPv2. It even has a DSP command set. However, the DSP commands are useless for our use, because they work only with 16bit. It would be possible to emulate 32bit operations with it, but the performance will be relatively low. Therefore we will have a look on the floating point unit.
The VFPv2 floating point unit
After having a look at the technical reference manual, it seems, that the floating point unit can process most floating point operations in a single clock cycle. On the Raspberry Pi this would mean a theoretical floating point performance of 1 GFlops. However this is a completely theoretical number. Data has to be transferred between the main memory and the VFP. This means, the practical performance will be much lower. However, it still looks promising.
Simple test in C
Let’s see, what happens if we use s simple C program that runs a floating point multiplication over and over. Out program uses this inner multiplication (both variables are floats). It loops exactly 1,000,000,000 times over this operation.
f1[i] *= f2[i];
We use arrays to make sure to have the least efficient access (from memory to CPU back to memory) to the data. Therefore this is a worst-case scenario. With 1GFlops performance, the program should be finished in 1 second. Let’s see:
[email protected]:~/fp$ time ./ft real 1m9.358s user 1m7.960s sys 0m0.050s
Oops, that’s almost 70 seconds. That would result in a floating point performance of only 14 MFlops. And – yes, we did use the hardware VFP unit, the program was compiled that way.
That’s a lot less than even the simple HiFiBerry DSP light board. With 48kHz sample rate this would result in less than 300 operations per sample. That’s not much.
Now let’s use some code that does not need memory access all the time. It uses only two variables:
f *= g
This should need less memory access.
[email protected]:~/fp$ time ./ft real 0m44.142s user 0m43.340s sys 0m0.010s
It looks better, but the performance is still less than 25MFlops.
It is interesting to see, that even having the simple loop around the operations we used takes about 14 seconds. That means, the floating point performance is a bit better. In the second case it would be a bit more than 30 MFlops.
Conclusion: Using C code, the Raspberry Pi can be used only for simple DSP operations. However, there might be a chance to dramatically improve the performance by using highly optimized assembler code.
- ARMv6 Reference manual
- ARM1176JZF-S™ Technical Reference Manual
- ARM System Developer’s Guide