bela performance

thetechnobear · May 11, 2018

ok, this is a bit of a repeat of a previous post, but I wanted to dive into it a bit...

background:
a while back I ported the mutable instruments modules to 'pd externals' and released them for the Organelle, naturally Im now using these on bela, and planning on releasing them for bela and salt!

source code (including makefile) is here
https://github.com/TheTechnobear/Mi4Pd

the issue is I'm seeing pretty poor performance on Bela.

lets take one external as and example rngs~ , aka rings.

at a high level Organelle, both use an 1ghz single core arm, so are 'similar'

on an Organelle i can have 5 instances, and reach around 80% or less, still no audio issues, organelle is responsive.
(pd -audiobuf = 4 , so ~176 samples?)

on bela/salt, 2 instances, bring it to its knees - unresponsive etc.
(libpd -p128, i tried 256 is made little difference really, still not to organelle levels)

lets also consider, the rings code is written to run on a STM32F4 at 168mhz, so really the organelle is more in the ballpark of what id expect. ( iirc stm32f4 is neon)...
bela is performing more like one F4/168mhz, surely that's not right?

ok, there are some differences
organelle A9 1ghz, running at 44100 SR, pd -audiobuf 4 = 176
bela A8 ghz, running 48000 SR, 128 samples.

but I really do not think this accounts for the difference of 3x performance.

also bare in mind, the organelle is also running another 'mother process' and osc for the display, 4 pots... and is running a 'vanilla' arch linux distro.

i dont think the difference is in PD either ...
when running in libpd or pd, its the rngs external which is eating cpu, the overhead of pd is going to be neglible, esp when running multiple instances... as pd is only really glue thats calling the render function on the external.

compiler options, ok, you can see these in the CMakefile...
Ive got the recommend ones for Bela, but does not help, and actually makes little difference to using the same as i use for Organelle.

audio mode switches - there are no more switches, after the initial startup

monitoring
im using 'top' on the organelle, and looking at /proc/xenomai/sched/stat on bela.
on bela i can see most load (87%) on the bela audio thread.

so... im wondering really what to look at next?

compiler options :
i dont think this is the cause

libpd :
perhaps some oddity with buffer size?
im thinking to get rngs~ to report the buffer size and then see what i see on both organelle and bela,
but the code copes with different buffer sizes.

more effort, but I could compile rings as a native bela render.cpp, but that's a bit more effort.
id only be able to run one instance, but i could see if that load for one, was signficantly different to running one instance inside libpd (it shouldn't be)... but its a few hours for work that id probably 'bin' later.

also, generally, id say this thread is not about this one case...
it feels like the libpd perhaps on a number of my patches, is actually quite poor, so perhaps there is something a miss in the libpd render, that can be fine tuned... or something in the compile options.
im really not sure

finally, its quite possible im missing something,
perhaps how im measuring things, expectation, or something else?

so really im here, just to get some ideas of what to look at / try...
its not a critcism ... really just seeing if we can improve things.

thoughts on what to try?

footnotes

testing...
with above source, you can build, and then just throw on a pd patch, if you'd like me to put together a test patch i can do that too.

last tests were on salt from workshop which is running 3.3?
(ps. for this build, you should remove your global github creditials )

AndyCap · May 11, 2018

Have you tried a standalone benchmark test with some floating point stuff to see if there is a large difference in performance there?

Also is t_float defined correctly on the bela, it's not maybe a double?

thetechnobear · May 11, 2018

AndyCap Have you tried a standalone benchmark test with some floating point stuff to see if there is a large difference in performance there?

No, but really i think we are an 'order of magnitude' out here...
bare in mind, this MI code is written for a STM32F4, and Ive got it working fine on an Axoloti, these are 168Mhz chips, with only limited fpu (NEON).
and it looks like I'm getting similar performance on Bela which is a 1GHz ARM A8, it really should be many times more powerful..
even taking into account 'pd inefficiencies' , something seems a miss. no?

for completeness, the axoloti is STM32F429, 168mhz, is running at 48000 SR, with a 16 sample buffer!
(ive not done a side by side test yes, as that might just depress me )

Also is t_float defined correctly on the bela, it's not maybe a double?

I'll check, cant see why it should be different however to the Organelle. (which uses a plain pd install too)

also, this really won't make much of a difference, as the MI code uses floats ( its all c++, not pd), so only even then the external layer would have to convert from double to float, then the MI calcs would all be 32bit... and id assume that is where the 'load' is.

I do thinks its a quirk somewhere, just struggling to think exactly where?!

AndyCap · May 11, 2018

thetechnobear also, this really won't make much of a difference, as the MI code uses floats ( its all c++, not pd), so only even then the external layer would have to convert from double to float, then the MI calcs would all be 32bit... and id assume that is where the 'load' is.

Ah, I didn't look at the code closely enough just at all the t_floats at the top.

thetechnobear · May 11, 2018

ok, its all 32bit sizeof(t_float) sizeof(t_sample) both return 4 on both organelle and bela

one difference is incoming block size is 16 on bela, and 64 on organelle.
(its also 16 on axoloti, but lets ignore that for now )

note: buffer size (-p) is 128, so my assumption is its just chops this up into blocks of 16 samples.

@giuliomoro is there a way a way using a custom renderer i can change this to 64?
... I cant see anything, in the default_libpd_render.cpp, nor in z_libpd.h
im a bit confused the default is supposed to be 64, but this is 16, is this dependent on how you compile libpd?

Im not sure its going to make any difference, as its still the same work, and the stm is ok with 16, but would be nice to know its the same as Organelle, just to start eliminating differences.

anything else i can look at/try?

I could start doing more detailed comparisons between bela/organelle/stm32f4 (axoloti) , but not sure thats getting me further?

AndyCap · May 11, 2018

Well I just did a simple floating point C benchmark test:

Bela 8.3 seconds
Organelle 3.14 seconds

So, something strange is going on...

thetechnobear · May 11, 2018

AndyCap Well I just did a simple floating point C benchmark test:

Bela 8.3 seconds
Organelle 3.14 seconds

So, something strange is going on...

ouch, that is concerning... but in line with what I'm seeing... a 1/3 of the performance.
I assume you used the recommend bela compile options? (or just did it in the browser?)

what gets me is i know i can run one instance of this code on Axoloti, so Id expect Bela to be much better, given Axoloti is also low latency RTOS and 16 sample buffer... surely there must be something wrong here.

Ive just done another PD test...

a 'fairly' basic 4 note poly synth patch (polysynth if your familiar) , think its pure PD , no externals.

Organelle - I can get 3 instances (so 12 voices) running on Organelle , hits about 90% cpu, but the box is still running perfectly happily - and there is no audio drop outs.

Bela - 1 instance is 67% , 2nd instance brings it to its knees, I cant even type on an ssh console, and eventually I even lose the network connection.

@giuliomoro any suggestions? thoughts? something i can test? try?

also how to read /proc/xenomai/sched/stat correctly?
if I see bela_audio at 67% is this 67% of the total cpu , or 67% of some percentage of the total cpu (i.e. some is still being taken by linux) if so , how can i tell how much its really using in total?

giuliomoro · May 11, 2018

thetechnobear note: buffer size (-p) is 128, so my assumption is its just chops this up into blocks of 16 samples.

yes

thetechnobear is there a way a way using a custom renderer i can change this to 64?

no, you need to recompile libpd for that (there must be a more clever way but I cannot figure it out just now), see here https://forum.bela.io/d/101-compiling-puredata-externals/47..

Everything else, I will have a look later. Show time!

AndyCap · May 11, 2018

thetechnobear ouch, that is concerning... but in line with what I'm seeing... a 1/3 of the performance.
I assume you used the recommend bela compile options? (or just did it in the browser?)

I have tried various compile options, what are the recommended Bela ones?

giuliomoro · May 11, 2018

-O3 -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=neon -ftree-vectorize -ffast-math

I am not so surprised this runs slower than other CPUs, this, the VFPlite on the A8 should take ~10 cycles per instruction as opposed to the full-fledged VFPs. So most likely many of the instructions generated from the compiler are VFP as opposed to NEON.

AndyCap · May 11, 2018

Thanks, I had the same flags.

Integer maths seems around the same speed, the float is pretty slow though compared to the A9.

thetechnobear · May 11, 2018

giuliomoro So most likely many of the instructions generated from the compiler are VFP as opposed to NEON.

in my externals I've told it -mfpu=neon , so should be using neon.
(also a PD patch also has the same issues, so that would be how libpd is compiled, no?)

I do understand the A9 might be better performance, but the thing I'm struggling to understand is.
the MI code runs fine on Axoloti , Cortex M4, NEON, (and same in MI modules), running at 168mhz...
so lets say rings/elements etc is running on that ok, surely Bela/BBB/A8 should be able to run 2 or 3 instances with 'relative' easy.... but it seems to be able to run only 1, that means Linux/libpd is using the rest of 800mhz

giuliomoro So most likely many of the instructions generated from the compiler are VFP as opposed to NEON.

hmm, so i guess if i get it to produce an assembly listing during compilation, and then search for VFP instructions? any particular ones?
( I can definitely see the -mfpu=neon flag when i compile, but perhaps it some how gets 'ignored'?)

have fun at your event tonight

AndyCap · May 11, 2018

I did another test, very simple just additions and divides and the Bela with that test is running at 50% speed of the Organelle, so an improvement over my previous test with used a lot of the maths functions.

thetechnobear · May 11, 2018

oh b*** it gone from bad to worst ....

I had the 'offending' patch as startup, and now when I try to connect to salt via usb (ssh or web browser) it doesnt connect. (shows ip , and pings but no connecting) - lack of cpu?

so I thought ok, pop it out of the rack, pop out the sdcard, mount on another linux , fix , back in biz.

nope, the image is on the emmc... so now have to grab an sdcard, get an image, mount the emmc fix it ..
(well i think thats what i have to do)
moral of story, dont do hacking on the salt

AndyCap · May 11, 2018

Whoops

thetechnobear · May 11, 2018

ok, back in the land of the living
what you need to do if you need to 'fix' is

pop it out of the rack,
put in sdcard with beaglebone image, i used one from my other belas
boot it up
connect,

mkdir tmp
mount /dev/mmcblk1p2 tmp
cd tmp

do what ever fixes you need

umount tmp
systemctl poweroff

pop out sdcard
put back in the rack

the first boot, I found it took ages to (3 mins!) for the network to come up, then after that now seems fine.
(Im guess fsck doing checks or something due to number of unclean shutdowns?)

AndyCap · May 12, 2018

Just been looking at this again using some simple code that scales a float array using multiplication and have noticed very slow times for clang vs gcc

Organelle gcc = 4.273 seconds
Bela gcc = 9.985 seconds
bela clang = 28.838 seconds

I'm going to change it to use NEON intrinsics to see the difference...

AndyCap · May 12, 2018

And using NEON intrinsics:

Organelle gcc = 3.749 seconds
Bela gcc = 9.369 seconds
bela clang = 9.670 seconds

From what I have read on the web NEON on the A8 should be around the same speed as the A9 so I am at a loss here.

AndyCap · May 12, 2018

Ok, it seems to be a memory alignment issue on ram->Neon register. I aligned my float array to 64 bytes and now get this:

Using NEON Intrinsics 64 byte align

Organelle = 4.755 seconds (slower!)
Bela Gcc = 3.039 seconds
Bela clang = 5.662 Seconds

Using standard C code 64 byte align

Organelle = 4.066 Seconds
Bela GCC = 4.874 seconds
Bela Clang = 3.439 seconds

thetechnobear · May 12, 2018

AndyCap
do you want to try adding

-funsafe-math-optimizations

If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.

Ive been meaning to try, but been side tracked and now have guests arriving
otherwise will try this evening.