Bela Mini: Latency x Blocksize, CPU load, Harmonic Distortion

Eedwillys · Jan 8, 2019

Hello all,

I've just got the Bela Mini capelet and started doing some tests on it, firstly with the audio pass-through project. Below I pin point some of the facts I observed and hopefully you can clarify them.

1) Regardless of the block frame size I set for the project, the measured latency is always the same (around 1ms). I activated the verbose mode with the "-v" parameter and indeed the right block length seems to be received. Plus, the CPU load decreases when increasing the block length, as expected, which implies that something is indeed happening. Do you have any idea why?

2) As mentioned above, when lowering the block size, the CPU load increases as expected. However, if I set it to as low as 2 sample per frame, the CPU load goes to ~30%, only for copying input to output! Is this high value really expected?

3) Still on the CPU load topic: changing the number of analog channels also has an impact on the CPU load. This is a bit counter-intuitive to me as I was expecting it to remain constant, since the sampling rate x number of channels factor remain constant.

4) I've also performed some initial harmonic distortion measurements. While it doesn't seem to perform very well (I can share the plots if need be), what surprised me was that it seems to vary with the block length too. is it possible that some samples are being cut off or something of the like?

My measurement set up consists of an RME Fireface sound card and some MATLAB scripts.

Best

giuliomoro · Jan 8, 2019

edwillys 1) Regardless of the block frame size I set for the project, the measured latency is always the same (around 1ms). I activated the verbose mode with the "-v" parameter and indeed the right block length seems to be received. Plus, the CPU load decreases when increasing the block length, as expected, which implies that something is indeed happening. Do you have any idea why?

The baseline roundtrip latency is about 41 samples, due to internal filtering of the ADC/DAC (39 samples) plus piping the sample to and from the converter. That is about 900 microseconds. Small block sizes (e.g.: 2, 4, 8) will have a limited impact on the overall roundtrip latency, e.g.: 8 samples per block should yield 1.29ms, so make sure the resolution of your measurement method is high enough to show sub-millisecond differences.

edwillys However, if I set it to as low as 2 sample per frame, the CPU load goes to ~30%, only for copying input to output! Is this high value really expected?

Yes. Actually, most of the CPU usage there is in waking up the thread and sending it back to sleep. You will see little to no change in CPU usage if you remove the memory copy. Waking up a thread normally takes between 10us and 20us. A 2 samples block size means that the thread should wake up every 45us. Therefore, the time that it takes to wake it up is a significant fraction of the amount of time it takes to run it, and, accordingly, this is what you see as "CPU usage". Also, as long as the block size is >=4, there is a built-in sleep() to make sure the IDE always remains accessible. This can be disabled passing --high-performance-mode to the running program, and it should show about 1.5% performance increase. The downside is that the IDE may become unresponsive when running a heavy-CPU program, at which point you'd have to press the button on the cape to stop the program.

edwillys 3) Still on the CPU load topic: changing the number of analog channels also has an impact on the CPU load. This is a bit counter-intuitive to me as I was expecting it to remain constant, since the sampling rate x number of channels factor remain constant.

That should not happen, unless you are running a non-C++ project: Pd, Supercollider etc use a behind-the-lines resampling to make the sampling rate of analog channels the same as the audio. This could have a small (I'd expect < 0.5%) impact on the CPU usage.

edwillys 4) I've also performed some initial harmonic distortion measurements. While it doesn't seem to perform very well (I can share the plots if need be), what surprised me was that it seems to vary with the block length too. is it possible that some samples are being cut off or something of the like?

You should have about -86dB THD+noise on the audio channels, but I'd be surprised if it varied with block length. Can you hear any distorted audio?

Eedwillys · Jan 11, 2019

Hello giuliomoro

Thanks for your reply.

giuliomoro The baseline roundtrip latency is about 41 samples, due to internal filtering of the ADC/DAC (39 samples) plus piping the sample to and from the converter. That is about 900 microseconds. Small block sizes (e.g.: 2, 4, 8) will have a limited impact on the overall roundtrip latency, e.g.: 8 samples per block should yield 1.29ms, so make sure the resolution of your measurement method is high enough to show sub-millisecond differences.

Correct. Indeed, the value I measured seems to correspond to the 2 sample per block configuration. I did test all of the block sizes though, including the 128 samples per block, which should reflect in a latency of 2x128/44100 + 900us ~= 7ms. This was in no way shown in the measurement. The resolution of the method I'm using is definitely in the sub-millisecond, as per other measurements I've already done in the past. It could be that I made a mistake during the measurement. I will repeat the measurements (also with a different equipment) and keep you posted. Could you maybe check if this reproducible in your end?

giuliomoro Waking up a thread normally takes between 10us and 20us. A 2 samples block size means that the thread should wake up every 45us. Therefore, the time that it takes to wake it up is a significant fraction of the amount of time it takes to run it, and, accordingly, this is what you see as "CPU usage". Also, as long as the block size is >=4, there is a built-in sleep() to make sure the IDE always remains accessible. This can be disabled passing --high-performance-mode to the running program, and it should show about 1.5% performance increase. The downside is that the IDE may become unresponsive when running a heavy-CPU program, at which point you'd have to press the button on the cape to stop the program.

Is there a way of reducing this context switch time or at least be more deterministic? In practice one would need to consider the worst case of 20us, i.e. 44% of idle load. I'm aware of the high performance mode and IMHO it should be set by default or at least be more visible (checkbox for example), as audio clicks are more critical than an unresponsive IDE for real time audio applications.

giuliomoro That should not happen

I retested it and it actually looks good for the 4x44100 and 8x22050 cases. However, for the 2x88200 case the CPU load is much higher and when I proceeded to a listening test, I realized audio wasn't working at all, so this might explain it There was a high pitch noise playing, as though the audio processing is blocked.

giuliomoro You should have about -86dB THD+noise on the audio channels, but I'd be surprised if it varied with block length. Can you hear any distorted audio?

Is this -86dB measured or estimated due to the CODEC? I'll proceed to a more precise measurement and share the results. I can't comment on the listening test just yet, as I was focusing on the electric measurement. Will also update on that.

giuliomoro · Jan 11, 2019

edwillys However, for the 2x88200 case the CPU load is much higher and when I proceeded to a listening test, I realized audio wasn't working at all, so this might explain

Oh. That's surprising, I will have to test that: it's a fairly uncommon mode of operation to use the mini with 88.2kHz analog inputs.

edwillys Is this -86dB measured or estimated due to the CODEC

Measured. The codec datasheet is more like 98dB in / 102bB out, IIRC

Eedwillys · Jan 12, 2019

Hello,

Small update on the block size influence on the latency. I measured the delay only using the Bela by calculating x-correlation between output and input (output routed to input with a jack cable and generating some white noise at the output) and indeed the delay values are very similar to the one mentioned in the paper http://eecs.qmul.ac.uk/~andrewm/mcpherson_aes2015.pdf . So, there was a problem with my measurement setup.

Best.

Cchrion · Nov 5, 2019

edwillys / Anyone else: Where did "2x128" come from, in your calculation of the latency, resulting in 7ms? Do you simply mean that you measured the latency of 2 sample blocks instead of 1, which is more common? I would like to get this correct in my head, thanks!

giuliomoro · Nov 5, 2019

See if this helps to understand the double buffering process.

alt text

Processing takes place in blocks. At any time, there are:
1. one block of data being read from the input, and not yet accessible to the user
- one block of data that the user code is processing
- one block of data that is being written to the output

A given audio frame will be in exactly one of these places at any given time.

For a signal to get from the input to the output, while being processed by the user, it has to go through these stages in the specified order, thus leading to the roundtrip latency being 2x the block size.

Eedwillys · Nov 9, 2019

Couldn't have explained better

Ward · Oct 31, 2024

giuliomoro The baseline roundtrip latency is about 41 samples, due to internal filtering of the ADC/DAC (39 samples)

I'm trying to implement a self-test with audio loopback by plugging the output back into the input.

The expected time shift of the signal should be 2 * bufferSize + 39 then?

giuliomoro · Oct 31, 2024

Is this a digital loopback or analog? If analog, you'll need to account for possible phase differences and/or fractional delays and gain inaccuracies that may make it hard to make a sample-by-sample comparison. The nominal values depends on the group delay of the codec in use.

Depending on the level of details that you need for the test you may use something more coarse. We use this to test the Bela capes: https://github.com/BelaPlatform/Bela/tree/master/examples/Extras/cape-test . It makes a lot of assumptions about the subset of things that can go wrong and tests specifically for those. It has served us well over the past 9 years for factory testing.