Ive been working on a partitioned convolution reverb last couple weeks. Found it pretty tricky, mostly through learning to code at the same time as visualising a difficult concept. I think it's a decent partitioning schema now though and seems to work :)

Currently upto 2.9 seconds reverb with around 40% cpu but super easy to extend to more.

Would welcome any feedback on weird bits of code or ideas of better ways to do things!
It's at: https://github.com/resynth/Convolution_Reverb_Bela

that's a very good start!

It would seem that you have a pretty large base latency, as you do FFT even for the smallest block ... you could instead do time-domain convolution on the smallest block to minimise the latency there. Given how this is meant for a reverb, it may not be a big deal, but if one wanted to do convolution for e.g.: cabinet simulation, it'd be better with minimal latency. You could use the Convolver library.

At a quick look, it seems that there is a lot of code repetition and it needs refactoring. Whenever you find yourself using variables with names ending in 1,2,3 ... it probably means you should be using a vector instead. I think you could the full step and take advantage of the (void* arg) in the AuxiliaryTask callbacks to pass distinct arguments to each and then you could use only one process_fft_background() and one process_fft() function. You will also need to get rid of the static variables in process_fft(): make them global and pass around pointers to them.

It's probably easier to group all the relevant stuff in a struct.
For instance:

struct FftData {
	std::vector<std::vector<float>> fdr;
	std::vector<std::vector<float>> fdi;
	std::vector<float> unwrappedBuffer;
	unsigned int fftSize;
	unsigned int hopSize;
	unsigned int windowSize;
	unsigned int cachedInputBufferPointer;
	unsigned int outputBufferWritePointer;
	unsigned int hopCounter;
	unsigned int outSize;
};

std::vector<FftData> gFftData;
std::vector<AuxiliaryTask> gTasks;
unsigned int gNumPartitions = 4;
unsigned int gFirstTaskPriority = 90;

void process_fft_background(void * arg);

bool setup(BelaContext* context, void*)
{
	// resize before starting so the objects never get moved
	gFftData.resize(gNumPartitions);
	gTasks.resize(gNumPartitions);
	for(unsigned int n = 0; n < gNumPartitions; ++n) {
		unsigned int fftSize = 128 * ((1 + n) * 4); // double check I got these right
		unsigned int hopSize = 64 * ((1 + n) * 4); // double check I got these right
		std::vector<std::vector<float>> fd(5, std::vector<float> (fftSize, 0));
		gFftData[n] = FftData{
			.fftSize = fftSize,
			.hopSize = hopSize,
			.windowSize = hopSize,
			.fdr = fd,
			.fdi = fd,
			.unwrappedBuffer = std::vector<float>(fftSize),
			.cachedInputBufferPointer = 0,
			.outputBufferWritePointer = 0,
			.hopCounter = 0,
			.outSize = hopSize * 2 -1,
		};
		std::string taskName = "bela-process-fft" + std::to_string(n);
		void* arg = (void*)&gFftData[n]; // a pointer to the nth element, casted to void*
		gTasks[n] = Bela_createAuxiliaryTask(process_fft_background, gFirstTaskPriority - n, taskName.c_str(), arg);
	}
	return true;
}

void process_fft(std::vector<float> const& inBuffer, std::vector<float>& outBuffer, FftData* fftData)
{
	// here, use:
	// fftData->fftSize, fftData->fdr, fftData->cachedInputBufferPointer, ...
	// instead of the corresponding global variables

}
void process_fft_background(void * arg)
{
	FftData* fftData = (FftData*)arg; // cast the pointer back
	process_fft(gInputBuffer, gOutputBuffer, fftData);
	fftData->outputBufferWritePointer = (fftData->outputBufferWritePointer + fftData->hopSize) % gBufferSize;
}

void render(BelaContext* context, void*)
{
	// here, use gFftData[k]->hopSize, gFftData[k]->hopCounter and
	// gFftTasks[k], gFftData[k]->cachedInputBufferPointer
	// instead of the corresponding 1,2,3 variables
}

Thanks Giulio, the struct thing looks great. I think my biggest problem is managing the variables. I did start to put my floats into multidimensional vectors but found the less verbose names harder to read. The struct thing looks easier to follow.

I decided the latency doesn't matter so much for a reverb, it's just a few ms more pre-delay and sounds quite nice to my ears. Seems many IR's have pre-delay of near silence for the first few ms so I thought I might add some latency compensation whereby the first partition starts gHopSize*2 into the IR file, or manualy trim the IR files beforehand. I don't think having under 11.5ms predelay is desirable on a reverb so would rather trim IR files than give Bela more to do convolving near zeros. Having said that I have already used your direct convolution example for violin IR's on my electric violin and could easily that over to the reverb.

I'll give the struct a go as, at the moment, the only thing that's stopping me going stereo is the realisation of how many extra variables I'd have to manage in my quickly growing mess!

    resynth I'll give the struct a go as, at the moment, the only thing that's stopping me going stereo is the realisation of how many extra variables I'd have to manage in my quickly growing mess!

    There you go, managing variables like this you can expand to stereo and easily change the number of partitions.

    Did you do any performance comparison for time-domain convolution vs frequency-domain convolution for smaller blocksizes? I have the sensation that at 512 samples it may less CPU doing it in the time domain than in the frequency-domain ...

    Regarding using 4 auxilliary tasks rather than just 1, I was attempting to implement a method described by IMPLEMENTING REAL-TIME PARTITIONED CONVOLUTION ALGORITHMS ON CONVENTIONAL OPERATING SYSTEMS by Eric Battenberg, Rimas Avižienis. In their tests they found the quickest method was to farm each FDL off to it's own thread and assign them different priorities depending on their deadline. The longer FDL's have longer deadlines so I thought it made sense to use a separate auxilliary task for each with diminishing priorities.

    Does this approach make any sense with the way the auxilliary tasks work on Bela or am I chasing a bit of a red herring?

    That seems the right approach yes (and the code I drafted above accounts for having different AuxiliaryTasks). Each thread comes with a bit of overhead, though, so there is probably a point where it could be less CPU intensive to have fewer threads (and so try and lump many FFTs into the same one), but I think you are far from it at this moment.

    also it looks like this bit here could be refactored by being included into a for loop

        for(int n = 0; n < gFftSize; n++) {
            fdr[0][n] = fdr[1][n] + (gFft.fdr(n)*IRfd0[1][0][0][n] - gFft.fdi(n)*IRfd0[1][0][1][n]);
            fdi[0][n] = fdi[1][n] + (gFft.fdi(n)*IRfd0[1][0][0][n] + gFft.fdr(n)*IRfd0[1][0][1][n]);
        }
        // ******************** 2 ************************
        for(int n = 0; n < gFftSize; n++) {
            fdr[1][n] = fdr[2][n] + (gFft.fdr(n)*IRfd0[2][0][0][n] - gFft.fdi(n)*IRfd0[2][0][1][n]);
            fdi[1][n] = fdi[2][n] + (gFft.fdi(n)*IRfd0[2][0][0][n] + gFft.fdr(n)*IRfd0[2][0][1][n]);
        }
        // ******************** 3 ************************
        for(int n = 0; n < gFftSize; n++) {
            fdr[2][n] = fdr[3][n] + (gFft.fdr(n)*IRfd0[3][0][0][n] - gFft.fdi(n)*IRfd0[3][0][1][n]);
            fdi[2][n] = fdi[3][n] + (gFft.fdi(n)*IRfd0[3][0][0][n] + gFft.fdr(n)*IRfd0[3][0][1][n]);
        }
        // ******************** 4 ************************
        for(int n = 0; n < gFftSize; n++) {
            fdr[3][n] = fdr[4][n] + (gFft.fdr(n)*IRfd0[4][0][0][n] - gFft.fdi(n)*IRfd0[4][0][1][n]);
            fdi[3][n] = fdi[4][n] + (gFft.fdi(n)*IRfd0[4][0][0][n] + gFft.fdr(n)*IRfd0[4][0][1][n]);
        }

    could become (please double check!):

    for(unsigned int k = 0; k < 4; ++k) {
        for(int n = 0; n < gFftSize; n++) {
            fdr[k][n] = fdr[k+1][n] + (gFft.fdr(n)*IRfd0[k+1][0][0][n] - gFft.fdi(n)*IRfd0[k+1][0][1][n]);
            fdi[k][n] = fdi[k+1][n] + (gFft.fdi(n)*IRfd0[k+1][0][0][n] + gFft.fdr(n)*IRfd0[k+1][0][1][n]);
        }
    }

    I wrote it like that as when I was working on the first FDL it started as one for loop and I kept adding more. When it came to adding the other FDL's I just copied and pasted the previous. It occured to me I could nest the for statements as you describe but 2 of the partitioned convolution papers I read suggest loop unrolling to help with performance. Is that an old fashioned idea from when computers were a bit slower or something that only matters when you have several for loops nested? Seems I'm asking the cpu to do so many complex multiplications that surely checking a few more conditions can't make that much difference? It would certainly make it more readable/servicable and shorter!