I run scsynth with

scsynth -t 57110 -l 64 -z 64 -Z 64 -J 8 -K 8 -G 16 -i 2 -o 2

so far everything works well, apart from, when I create "not so much but considerable" load (equivalent of 3% on my mac), I get a fair amount of Underrun detected: 1 blocks dropped. Rarely 2 blocks dropped, never more than 2.
I tried to increase the HW buffer size and block size with -z and -Z but this does not help at all. I wonder why.

If, by all means, I am not able to get rid of this, doesn't the posting itself slows down the rt processing itself?
Is there a possibility to suppress the posting (I already found 2> /dev/null), already where it happens?

Turning off that notification is possible on a Bela program but not from SC. I guess we should finish working on this to make it possible: https://github.com/sensestage/supercollider/issues/21 . Until then, the only possibility is to re-compile supercollider with

diff --git a/server/scsynth/SC_Bela.cpp b/server/scsynth/SC_Bela.cpp
index 8e3430572..aa66194be 100644
--- a/server/scsynth/SC_Bela.cpp
+++ b/server/scsynth/SC_Bela.cpp
@@ -335,6 +335,7 @@ bool SC_BelaDriver::DriverSetup(int* outNumSamples, double* outSampleRate)
 {
        BelaInitSettings settings;
     Bela_defaultSettings(&settings);
+     settings.detectUnderruns = 0;
     settings.setup = sc_belaSetup;
     settings.render = sc_belaRender;
     settings.interleave = 0;

here's (a/the) synth I'm running (from an external sclang):

On Bela:

scsynth -t 57110 -l 64 -z 16 -J 8 -K 8 -G 16 -i 2 -o 2

or

scsynth -t 57110 -l 64 -z 32 -Z 32 -J 8 -K 8 -G 16 -i 2 -o 2

in sclang (on my main machine):

(
g = Server(\bela, NetAddr("192.168.7.2", 57110));
g.startAliveThread(0);
g.doWhenBooted({
	g.notify;
	g.initTree
});
)

(
x = {
	var freq = \freq.kr(50, 0.1);
	var tilt = \tilt.kr(0.5);
	var invTilt = (1-tilt) * 2;
	tilt = tilt * 2;
	(0..15).inject(Pulse.ar(LFNoise1.ar(LFNoise1.kr(0.1).range(0.1, 2)).exprange(freq, freq + 1), 0.2) * LFNoise1.ar(Rand(0.01, 0.1)), { |in|
			var amp = Amplitude.ar(in, 0.1, 1);
			Mix([tilt * in, invTilt * [RHPF, RLPF, BPF].choose.ar(DelayL.ar(in, 0.1, 1-amp.lag(Rand(0.1, 1)) * Rand(0.005, 0.1)).tanh.neg, amp * ExpRand(2000, 4000) + 100, Rand(0.8, 1))]);
		}).tanh
}.play(g)
)

I had a quick look at nova_simd and they do not seem to provide a NEON implementation of tanh. I then looked over their C++ implementation. I have no idea what it does, but it looks much less efficient than the C version and the Neon version used by math_neon: https://github.com/giuliomoro/math-neon/blob/master/math_tanhf.c.

Note that the NEON implementation in math_tanhf.c is three times faster than the C implementation in the same file, and 3.5 times faster than the libmath implementation.
So perhaps it is worth giving a try at porting the NEON implementation to nova-simd.

A more in-depth analysis:
- using (0..5) instead of (0..15) gives 60% CPU
- stripping off the final .tanh goes to 58%
- additionally stripping off the .tanh in ).tanh.neg, goes to 48%

Note that the base CPU usage is still around 18 as a baseline. This should be improved at some point.