Hi,
I'm working on some granular / convolution effect. Because of the performance issue of the time domain convolution, I wanted to use Neon instructions to speed up the rendering. It works fine for 2x floats, using the SIMD basically to speed up stereo processing. (I do two convolutions in parallel)
However, when I switched to using the 4 single-precision Neon types / instructions, I basically get swamped with mode switches. I wrote a small wrapper around the GCC Neon functions / types, the 2x float and 4x float version look identical (minus the matching functions, obviously) so I'm a bit lost, why the 2x version works but the 4x version somehow trashes the system.
Is there anything I'm missing here?