Help with underruns using SDT for Puredata on Bela?

Ssdaau · Nov 23, 2018

I'm trying to use the https://github.com/SkAT-VG/SDT toolkit on the Bela.

This is how I compile it, via ssh root@192.168.7.2, on the Bela:

mkdir /root/src
cd /root/src/
git clone https://github.com/SkAT-VG/SDT.git SDT_git

After cloning has completed, the Makefile should be changed according to the following patch:

diff --git a/build/linux/Makefile b/build/linux/Makefile
index 6eb30e0..16332e5 100644
--- a/build/linux/Makefile
+++ b/build/linux/Makefile
@@ -1,7 +1,10 @@
 SHELL=/bin/bash
 
 CC=gcc
-CFLAGS=-fPIC -Wall -Wno-unknown-pragmas -Werror -O3
+# added PD_INCLUDE
+PD_INCLUDE = /usr/local/include/libpd
+# added -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=neon -ftree-vectorize --fast-math
+CFLAGS=-fPIC -Wall -Wno-unknown-pragmas -Werror -O3 -I"$(PD_INCLUDE)" -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=neon -ftree-vectorize --fast-math
 LDFLAGS=-shared
 SRCDIR=../../src
 PREFIX=/usr

.... and then compilation can proceed:

cd SDT_git/build/linux
make

# and this to install:
cp -av ../../src/SDT/libSDT.so /usr/lib/
cp -av ../../src/Pd/*.pd_linux /root/Bela/projects/pd-externals/
cp -av ../../Pd/* /root/Bela/projects/pd-externals/

Compilation completes without a problem.

For testing, I create a new PD project on the Bela, and as the _main.pd, I use https://github.com/SkAT-VG/SDT/blob/master/Pd/breaking%7E-help.pd - except I have to explicitly add [SDT] so the other SDT objects can be found/instantiated; and I add some loadbang delays, and a [metro] to keep retriggering sound production, as in:

bela-SDT

... and copying the output to the dac 3 and 4, in addition to 1 and 2:

bela-SDT2

When I run this patch on laptop (Ubuntu 18.04), it runs without any messages about problems.

I have compiled and ran this on "Bela image, v0.3.1, 8 November 2017" Bela.io, and if I remember correctly, there were some underruns detected, but not very often.

Recently I performed a full update to "Bela image, v0.3.6b, 23 October 2018" - including both flashing the SD card, and flashing the eMMC (using the Beaglebone eMMC procedure). When I run this patch on the Bela, with Project Settings of Block size (audio frames): 16 (which was the default for me, I guess), then I get a ton of messages like:

Building project...
Build finished
Running project...
Running Pd 0.48-2
Audio channels in use: 2
Analog channels in use: 4
Digital channels in use: 16
No MIDI device enabled
bonk version 1.5
fiddle version 1.1 TEST4
pique 0.1 for PD version 23
sigmund~ version 0.07
=== SDT - Sound Design Toolkit ===
Version 078, (C) 2001 - 2018
Project SOb - http://soundobject.org
Project CLOSED - http://closed.ircam.fr
Project NIW - http://soundobject.org/niw
Project SkAT-VG - http://skatvg.eu
Included externals:
bouncing~ breaking~ bubble~ crumpling~ dcmotor~ demix~ envelope~
explosion~ fluidflow~ friction~ impact~ inertial modal
motor~ myo~ pitch~ pitchshift~ reverb~ rolling~
scraping~ spectralfeats~ windcavity~ windflow~ windkarman~ zerox~
Underrun detected: 1 blocks dropped
Underrun detected: 2 blocks dropped
Underrun detected: 1 blocks dropped
Underrun detected: 1 blocks dropped
Underrun detected: 1 blocks dropped
Underrun detected: 3 blocks dropped
Underrun detected: 2 blocks dropped
Underrun detected: 2 blocks dropped
Underrun detected: 1 blocks dropped
Underrun detected: 1 blocks dropped
...

... and the underrun messages are plenty and go very fast, occasionally I even get a console message about incoming messages being too fast - and CPU use is around 60%...

While I'm a bit uncertain about block size ( https://forum.bela.io/d/715-block-size-with-puredata-on-the-bela ), if I change Project Settings of Block size (audio frames): 64, and run again (which does not trigger a rebuild of the executable) then I get strictly "Underrun detected: 1 blocks dropped" (maybe with 2 blocks every once in a while), and CPU use is towards 53-55% - so block size apparently makes a difference, but is still there even with the max block size of 128. When I listen to the block size 64 output, however, it doesn't sound worse from the one on from the desktop PD.

Maybe the problem is partially due to the external objects not being self-contained, but instead, depending on some generic code in an .so library; for instance, https://github.com/SkAT-VG/SDT/blob/master/src/Pd/breaking%7E.c uses:

t_int *breaking_perform(t_int *w) {
  t_breaking *x = (t_breaking *)(w[1]);
  t_float *out0 = (t_float *)(w[2]);
  t_float *out1 = (t_float *)(w[3]);
  int n = (int)w[4];
  double tmpOuts[2];
  
  while (n--) {
    SDTBreaking_dsp(x->breaking, tmpOuts);
    *out0++ = (t_float)tmpOuts[0];
    *out1++ = (t_float)tmpOuts[1];
  }
  return w + 5;
}

... where SDTBreaking_dsp is defined in https://github.com/SkAT-VG/SDT/blob/master/src/SDT/SDTControl.c (which ends up in the .so, I guess).

As you can see from the Makefile patch, I've tried using Cortex-specific optimizations ( https://forum.bela.io/d/101-compiling-puredata-externals/30 ), but that didn't help much.

Would anyone have an idea why these underruns happen, and what - if anything - can be done to prevent them?

giuliomoro · Nov 23, 2018

sdaau After cloning has completed, the Makefile should be changed according to the following patch:

sometimes it's easier to set those options from the command-line when calling make, without need to patch the file:

make PD_INCLUDE=/usr/local/include/libpd CFLAGS="-fPIC -Wall -Wno-unknown-pragmas -Werror -O3 -I"$(PD_INCLUDE)" -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=neon -ftree-vectorize --fast-math"

sdaau except I have to explicitly add [SDT] so the other SDT objects can be found/instantiated; and I add some loadbang delays,

You could avoid that in one of two ways:
- the hacky way: make sure [SDT] is instantiated before everything else, e.g.: cut everything else in the patch and paste it back.
- the proper way: use [declare] to pre-load the library.

sdaau Maybe the problem is partially due to the external objects not being self-contained, but instead, depending on some generic code in an .so library

Nope, that's not a problem by itself.

sdaau Would anyone have an idea why these underruns happen, and what - if anything - can be done to prevent them?

The usual suspect when you get low average CPU usage but repeated underruns is that the CPU load for the audio callback is not constant. Some Pd vanilla objects (e.g.: [fft~], [sigmund~], [fiddle~], possibly [rms~]) have such a property: most of the time they do nothing, they just copy input samples into a buffer. Every so often, when the buffer is full, they run some expensive computation (e.g.: FFT) in the audio thread, which takes longer than the time available to process one block of data. So you get a low average CPU usage, but for each of the spikes you get a dropout. This is often mitigated by increasing Bela's blocksize. Now, by a quick inspection of the code you linked, I don't see such a behaviour in that specific object, but I may be wrong. Also, maybe there is somewhere else in your patch that causes that?

Ssdaau · Nov 23, 2018

Many thanks for the response, @giuliomoro :

giuliomoro sometimes it's easier to set those options from the command-line when calling make, without need to patch the file:

Thanks! I thought that wouldn't work, because the Makefile didn't specify PD_INCLUDE to begin with, but since only CFLAGS uses that variable anyways, this invocation now makes sense to me!

giuliomoro - the proper way: use [declare] to pre-load the library.

Great - thanks for this, for some reason I thought [declare] wasn't vanilla PD - I just tried it out, and for the install procedure described above, [declare -lib SDT] works fine on the Bela.

giuliomoro Nope, that's not a problem by itself.

Good - thanks for confirming that, nice to get it out of the way.

giuliomoro So you get a low average CPU usage, but for each of the spikes you get a dropout. This is often mitigated by increasing Bela's blocksize.

Thanks for this - so knowing this, and along with information from https://forum.bela.io/d/715-block-size-with-puredata-on-the-bela, I did the following:

First, I added [block~ 128] to a subpatch in the patch, and so the entire patch now looks like this (also, here is TestSDT/_main.pd):

sdt-pd-block

I tried having different values of [block~ N] vs. different blocksizes from Bela's Project Settings - and it seems, the [block~ N] didn't influence underruns much, but Project Settings did.

Then I also recompiled libpd ( as per https://forum.bela.io/d/101-compiling-puredata-externals/47 ), setting DEFDACBLKSIZE first to 64, then experimenting again with [block~ N]/Project Settings, then recompiling libpd with DEFDACBLKSIZE of 128, then experimenting again with [block~ N]/Project Settings - and it seems the biggest influence, regardless, was the Project settings

So in the end I went back to DEFDACBLKSIZE of 16, and made the following script, runblock.sh, which should be placed and run from the same folder where _main.pd resides:

#!/usr/bin/env bash
# NB: you need to be inside of a PD project folder on the Bela to run this
set -x
BLKSZA="$1"
BLKSZB="$2"
sed -i 's/block~ \([0-9]*\)/block~ '$BLKSZA'/' _main.pd
grep block _main.pd
# don't use build_project.sh --clean here, it deletes the newly compiled executable!
# just manually rm instead
rm -v $(find . -type f -executable)
~/Bela/scripts/build_project.sh --force -n $(readlink -f .)/
# ts program: `apt install moreutils` on a Bela with internet
./TestSDT -p $BLKSZB 2>&1 | LC_ALL=C ts '[%Y-%m-%d %H:%M:%S]'

So, basically first argument of the script is the "logical" block size as set by [block~] which is directly changed inside the PD file, and the second argument is the "hardware" block size set by -p command line argument when running the final executable. Here are a couple of runs:

root@bela:~/Bela/projects/TestSDT# bash runblock.sh 512 512
+ BLKSZA=512
+ BLKSZB=512
+ sed -i 's/block~ \([0-9]*\)/block~ 512/' _main.pd
+ grep block _main.pd
#X obj 104 65 block~ 512;
+ ./TestSDT -p 512
+ LC_ALL=C
+ ts '[%Y-%m-%d %H:%M:%S]'
[2018-11-23 15:29:20] Underrun detected: 1 blocks dropped
[2018-11-23 15:29:22] Underrun detected: 1 blocks dropped
[2018-11-23 15:29:31] Underrun detected: 1 blocks dropped
[2018-11-23 15:29:41] Underrun detected: 1 blocks dropped
[2018-11-23 15:29:53] Underrun detected: 1 blocks dropped
[2018-11-23 15:29:54] Underrun detected: 1 blocks dropped
[2018-11-23 15:29:58] Underrun detected: 1 blocks dropped

root@bela:~/Bela/projects/TestSDT# bash runblock.sh 128 512
+ BLKSZA=128
+ BLKSZB=512
+ sed -i 's/block~ \([0-9]*\)/block~ 128/' _main.pd
+ grep block _main.pd
#X obj 104 65 block~ 128;
+ LC_ALL=C
+ ts '[%Y-%m-%d %H:%M:%S]'
+ ./TestSDT -p 512
[2018-11-23 15:30:21] Underrun detected: 1 blocks dropped
[2018-11-23 15:30:32] Underrun detected: 1 blocks dropped
[2018-11-23 15:30:42] Underrun detected: 1 blocks dropped
[2018-11-23 15:30:54] Underrun detected: 1 blocks dropped
[2018-11-23 15:30:55] Underrun detected: 1 blocks dropped
[2018-11-23 15:30:59] Underrun detected: 1 blocks dropped

root@bela:~/Bela/projects/TestSDT# bash runblock.sh 16 512
+ BLKSZA=16
+ BLKSZB=512
+ sed -i 's/block~ \([0-9]*\)/block~ 16/' _main.pd
+ grep block _main.pd
#X obj 104 65 block~ 16;
+ LC_ALL=C
+ ts '[%Y-%m-%d %H:%M:%S]'
+ ./TestSDT -p 512
[2018-11-23 15:32:13] Underrun detected: 1 blocks dropped
[2018-11-23 15:32:15] Underrun detected: 1 blocks dropped
[2018-11-23 15:32:24] Underrun detected: 1 blocks dropped
[2018-11-23 15:32:34] Underrun detected: 1 blocks dropped

So - if I increase the hardware block size, via -p to 512, the underruns decrease significantly in frequency, but they still occur - sometimes at a second apart, sometimes at 10 second apart - but I noticed, almost always in sync with the [metro]ed [bang] from the patch; changing the logical block size here does not seem to have much effect.

Changing the hardware block size, via -p to 1024, decreases the frequency of underruns even further - however, this causes a high-pitched noise to appear on the output, so I can't really use it. Curiously, this appearance of high-pitched noise for -p 1024 (vs -p 512) appeared all the same for the different DEFDACBLKSIZE libpd builds.

giuliomoro The usual suspect when you get low average CPU usage but repeated underruns is that the CPU load for the audio callback is not constant. Some Pd vanilla objects (e.g.: [fft~], [sigmund~], [fiddle~], possibly [rms~]) have such a property: most of the time they do nothing, they just copy input samples into a buffer. Every so often, when the buffer is full, they run some expensive computation (e.g.: FFT) in the audio thread, which takes longer than the time available to process one block of data. So you get a low average CPU usage, but for each of the spikes you get a dropout.

giuliomoro Now, by a quick inspection of the code you linked, I don't see such a behaviour in that specific object, but I may be wrong. Also, maybe there is somewhere else in your patch that causes that?

Yeah, that could be it. When I do grep '\~' TestSDT/_main.pd, I can see besides the vanilla DSP objects *~, dac~, block~, the only other DSP objects in use are impact~ and breaking~. The breaking~ we saw earlier, and https://github.com/SkAT-VG/SDT/blob/master/src/Pd/impact~.c has this DSP function:

  t_impact *x = (t_impact *)(w[1]);
  t_float *in0 = (t_float *)(w[2]);
  t_float *in1 = (t_float *)(w[3]);
  t_float *in2 = (t_float *)(w[4]);
  t_float *in3 = (t_float *)(w[5]);
  t_float *in4 = (t_float *)(w[6]);
  t_float *in5 = (t_float *)(w[7]);
  int n = (int)w[8];
  double tmpOuts[2 * SDT_MAX_PICKUPS];
  int i, k;
  
  for (k = 0; k < n; k++) {
    SDTInteractor_dsp(x->impact, *in0++, *in1++, *in2++, *in3++, *in4++, *in5++, tmpOuts);
    for (i = 0; i < x->nOuts; i++) {
      x->outBuffers[i][k] = (t_float)tmpOuts[i];
    }
  }
return w + 9;

... which looks short == fast, but then SDTInteractor_dsp is in https://github.com/SkAT-VG/SDT/blob/master/src/SDT/SDTInteractors.c, and it looks a lot more complicated.

Now, even though breaking~ goes into impatch~ mostly through a DSP line, the fact that I see the underruns in sync with the triggering bang, tells me that possibly, the triggering bang "introduces energy into the system", thus causing some of the if'd portions of SDTInteractor_dsp, possibly like:

  if (v0 && x->obj0) {
    p = x->obj1 ? SDTResonator_getPosition(x->obj1, x->contact1) : 0.0;
    SDTResonator_setPosition(x->obj0, x->contact0, p);
    SDTResonator_setVelocity(x->obj0, x->contact0, v0);
}

... might now run (as opposed to not running for the most of the time, when the system is not "bang"ed); and since SDTInteractor_dsp is in itself called from a for loop from impact_perform, this might cause additional burden to the CPU - enough for an underrun to happen occasionally. But if that is the case, then I don't see an obvious way to solve it, - apart from, well, faster math (tried in optimizations alredy done), and faster CPU. Does this makes sense?

Is there anything else I could possibly try?

giuliomoro · Nov 23, 2018

sdaau First, I added [block~ 128] to a subpatch in the patch, and so the entire patch now looks like this (also, here is TestSDT/_main.pd):

that's NOT how you are supposed to use [block~]: it only affects the subpatch it is contained in (and all its subpatches), so that one is currently doing nothing.

sdaau + ./TestSDT -p 512

sdaau Changing the hardware block size, via -p to 1024, decreases the frequency of underruns even further - however, this causes a high-pitched noise to appear on the output, so I can't really use it

uops, forgot to mention that the maximum block size at which inputs and outputs all work fine is 128.

sdaau Then I also recompiled libpd ( as per https://forum.bela.io/d/101-compiling-puredata-externals/47 ), setting DEFDACBLKSIZE first to 64, then experimenting again with [block~ N]/Project Settings, then recompiling libpd with DEFDACBLKSIZE of 128, then experimenting again with [block~ N]/Project Settings - and it seems the biggest influence, regardless, was the Project settings

I remember there was a small performance gain going up from 8 to 16 (possibly about 2% CPU was saved), I would expect a similar CPU saving going up to 64, but definitely not a game changer.

It seems that something is being triggered by the metro that causes this underrun. However, none of those functions looks particularly expensive, but there would be need to look more in detail into them. Once the issue is identified, depending on what it is, there may be different solutions.

You could try to debug things in Pd, by trying different configurations. Failing that, inspect the C code in detail. Failing that, then you could try to add some benchmarking to the C code: use clock_gettime() at the beginning and end of each callback, or each of the functions it calls, and see if you can narrow down the problem.

Ssdaau · Nov 24, 2018

Thanks again, @giuliomoro :

giuliomoro that's NOT how you are supposed to use [block~]: it only affects the subpatch it is contained in (and all its subpatches), so that one is currently doing nothing.

I see, thanks - so I guess, I should move all processing into a subpatch, which also contains [block~], and then add two [outlet~]s, so as to connect to [dac~] in the main patch (so the [block~] and [dac~] don't conflict, as per help file)?

giuliomoro uops, forgot to mention that the maximum block size at which inputs and outputs all work fine is 128.

Got it - this example I need just for outputs, so its possible that's why I didn't notice anything wrong by going to 512.

giuliomoro It seems that something is being triggered by the metro that causes this underrun. However, none of those functions looks particularly expensive, but there would be need to look more in detail into them.

Yeah, I had a similar hunch about the functions, too, but wasn't really sure... Guess more debugging is needed here - if I get a bit further will this, will post here...

giuliomoro · Nov 24, 2018

sdaau I see, thanks - so I guess, I should move all processing into a subpatch, which also contains [block~], and then add two [outlet~]s, so as to connect to [dac~] in the main patch (so the [block~] and [dac~] don't conflict, as per help file)?

yes. again, that is only if you want to change the logic blocksize (e.g.: to change the size of an fft), but this is not going to improve CPU performance. if anything, it will make them more uneven, and therefore more likely to have dropouts with a low average CPU usage.

giuliomoro · Nov 24, 2018

Ultimately, the bottleneck seems to be in SDTInteractor_dsp(). This is called once per each sample, and the inlets to the[impact~] object are passed as arguments to the function. When any of the inputs are not zero, it does something special.
It seems fairly complicated to find out where exactly the excessive CPU usage is. Possibly it's best to use a profiler to see what is happening exactly. I would do that without using libpd or the Bela environment at all: just write a stand-alone program that links to libSDT, initializes the structure, sends the appropriate setting values, and ultimately runs the SDTInteractor_dsp(), while faking the input values. You could use gprof for this (after compiling with the appropriate flags), or there may be something better around these days.