heavy on newest bela release

giuliomoro

looks like that option is not supported on the clang / gcc on the board. It was probably introduced more recently.

AndyCap

Ah, what version of clang is in the board?

giuliomoro

It's 3.9 on the board. clang 6.0 seems to have it, and you can install it with apt-get from stretch backport (though I haven't tried: I just downloaded if from llvm.org).

Haven't managed to make it produce the desired code yet, though.

giuliomoro

For whatever reason, on clang-3.9, alignof(float32x4_t) is 16 bytes, while on clang-6 (and on gcc-6.3), it is 8 bytes. Especially the clang-3.9 vs 6 is particularly weird, because their respective arm_neon.h includes are very similar ....

AndyCap

I'm a bit confused!

I though the original issue was that 3.9 wasn't aligning to 16?

giuliomoro

AndyCap I though the original issue was that 3.9 wasn't aligning to 16?

The original issue is that on clang-3.9, alignof(float32x4_t) is 16, and it (probably) generates assembly code that assumes the memory to be aligned. If the memory is not effectively aligned to a 16 bytes boundaries, you get a SIGBUS.

However, when allocating with operator new an object of a class that contains one or more float32x4_t, there is no guarantee that these will effectively be allocated with the appropriate alignment. The underlying issue is that according to the C++ standard, the alignment of a pointer returned by operator new is undefined. Hence the workaround above with a custom allocator. It seems that C++17 offers some more clever support for aligned heap allocation, but I didn't get into the details of that.

What is surprising is that while alignof(float32x4_t) is 16 bytes on clang-3.9, it is actually 8 bytes on gcc-6.3 and clang-6.0. I guess that this could explain the both:
- the performance penalty for gcc-6.3 vs clang-3.9 when using intrinsics
- the lack of SIGBUS on gcc-6.3: the generated assembly does not make assumptions about alignment that will then be neglected at runtime.

Seeing how this is alignof(float32x4_t) has become 8 bytes in clang-6.0, I am afraid this may also come with a performance penalty (but I haven't verified it yet). However, I am wondering where that difference comes from, as the arm_neon.h files for the two versions of clang provide identical definition for all the neon vector types.

Incidentally, given how the change in alignof can also change the size of structs and classes, this suggest that header files for libraries should abstract away such classes and structs, so that the library will work fine even if the compiler used to compile it is different from the one used to build the user code that uses it.

lokki

this is for android (arm) but seems to talk about the clang problem:

https://github.com/axboe/fio/issues/356

giuliomoro

Linker problem

One issue of 3.9 vs 6.0 is that linking is always done on the board by calling clang++-3.9, and so if there is any incompatibility between the .o files generated by 6.0 and the way clang++-3.9 calls the ld linker, or the way the older ld linker on the board deals with object files, then you could have some issues. I guess this is unlikely to happen, but I am not sure what guarantees clang and ld give about interoperability of different versions.

Compiler problem

Another issue is that different versions of clang may decide to pack structs (or at least ARM intrinsics) in different ways, for instance:

giuliomoro What is surprising is that while alignof(float32x4_t) is 16 bytes on clang-3.9, it is actually 8 bytes on gcc-6.3 and clang-6.0.

and

giuliomoro However, I am wondering where that difference comes from, as the arm_neon.h files for the two versions of clang provide identical definition for all the neon vector types.

More in general, there would be a problem if you have object files that both include some .h file that is interpreted differently by the two versions of the compiler, and you compile the two objects with different versions of the compiler. Again, I am not sure what backward/forward compatibility guarantees clang provides. Clearly, if you ALWAYS use distcc to build all the files on Bela, there would be no problem. The problem only comes up (potentially) if you are intermixing .o files compiled with different versions of the compiler.
For instance - in you current situation - you built all the Bela core/ files with the local version of clang-3.9 the first time you built the first program. If you switch to using clang-6.0, I'd recommend you do a make -C ~/Bela clean on the board, so that those files are re-built next time you build a project.

lokki

giuliomoro

so where exactly do i put this in the file? in the public part of the class? sorry if this is obvious....

giuliomoro

yes

lokki

giuliomoro thanks, this works!

giuliomoro

Did some tests for CPU usage with the three compilers. On a v0.3.6b image:

Techno-world:

heavy/gcc-6.3        25% 
heavy/clang-3.9    21%
heavy/clang-6.0  24%

Your reverbtest patch:

heavy/gcc-6.3     16%
heavy/clang-3.9   18%
heavy/clang-6.0    16%

noting that, in both cases, clang-3.9 needs the fix above in order to run.

The fact that clang-6.0 is slower on techno-world is pretty annoying, though.

lokki

giuliomoro interesting...
so that is the value, the IDE shows,or how do you measure? might need to check on my actual granular patch...

giuliomoro

It's actually the value you see by doing

watch -n 0.5 cat /proc/xenomai/sched/stat

which is what the IDE grabs and displays. Running the above in the terminal is more lightweight than through the IDE, so you can run it even if the CPU usage is so high that the IDE is not responsive.

lokki

ok, so i notice something funny:

when the IDE is not running my patch is running at 69 to 70 %, when i run the IDE that value fluctuates from 58 to 70 %

giuliomoro

The CPU usage indicator in the IDE takes into account all the threads of the program, and both their Xenomai CPU usage (as shown in /proc/xenomai/sched/stat) and their Linux CPU usage (as shown by the top command). So for instance, if you have the scope window open, and you are using the Scope, then the CPU usage will differ between the IDE and the terminal, but in the IDE it should be higher. So I don't think it makes sense to explain your case.

Is there anything in the patch that would change the CPU usage over time? Wanna send it over?

lokki

giuliomoro Is there anything in the patch that would change the CPU usage over time? Wanna send it over?

i don't think so. also when the IDE is not running the CPU usage stays the same all the time. note that i used the "watch" method in both cases (with and without IDE) to measure usage.

sure, i'll send the patch.

giuliomoro

I have included the aligned new allocator in the heavy template file on my repo : https://github.com/giuliomoro/hvcc/commit/35a5bda6063025e2fc99904e1c47702152924c08 so now the fix will be automatically applied to newly built projects. Also fixed behaviour (on the Bela side) when the expected MIDI interface is not connected (https://github.com/BelaPlatform/Bela/commit/19349c4760b03e19fb3dfd343b604cebd1c506f9 ).

As for the performance of your patch, it runs at about 26% for me, not sure if it's because I don't have any buttons/MIDI hooked up.

lokki

giuliomoro

what? 26% with clang? which version? how could not connected buttons have an influence on cpu load? i will check out your repo and see. since i have a spare bela i can also try with a bareboard

giuliomoro

that's clang-3.9, stock compiler on the board. Maybe you still have some files that were built with -g -O0 above? -g adds debugging info to the file, but -O0 disable all optimizations. Maybe try cleaning the project (delete the build/ folder in there, or equivalently run make -C ~/Bela PROJECT=projectName clean and rebuild?

lokki ? how could not connected buttons have an influence on cpu load?

I don't know if there is something you do with buttons and MIDI that changes the CPU processing? Your patch is too complicated for me to understand it at a glance!

lokki

giuliomoro I don't know if there is something you do with buttons and MIDI that changes the CPU processing? Your patch is too complicated for me to understand it at a glance!

sorry, of course it is a mess actually 🙂

i only start writing to tables and change which tables are played back with the buttons. midi is only for note input to change playback speed and aftertouch to scroll through the table when in "scrub" mode. i am now recompiling with your new hvcc repo, and i will disconnect all midi controllers to see if it makes a difference in load. but my CPU was a steady 69 to 70 % regardless of any midi activity or button presses. that is why i was so astonished, the simple fact that a midi device and buttons are connected should not make that much of a difference.

EDIT: ahem stupid me i guess, should i be looking at the bela-audio load? that is as you suggested 26-28% i was looking at the ROOT process which takes 70%..

Every 0.5s: cat /proc/xenomai/sched/stat                               bela: Wed May  8 19:10:56 2019

CPU  PID    MSW        CSW        XSC        PF    STAT       %CPU  NAME
  0  0      0          4353657    0          0     00018000   70.0  [ROOT]
  0  1681   9          11         26         0     000600c0    0.0  granular
  0  1691   2          3          4          0     000480c0    0.0  bela-midiIn_hw:0,0,0
  0  1692   2          3          4          0     000480c0    0.0  bela-midiOut_hw:0,0,0
  0  1693   5          761787     761794     0     00048046   28.1  bela-audio
  0  0      0          2427976    0          0     00000000    0.7  [IRQ16: [timer]]
  0  0      0          380980     0          0     00000000    0.7  [IRQ180: rtdm_pruss_irq_irq]

does that look reasonable? does ROOT just take all that is left to get to 100% then? (i seem to remember some discussion in the forum between you and thetechnobear where this came up as well)

giuliomoro

lokki EDIT: ahem stupid me i guess, should i be looking at the bela-audio load? that is as you suggested 26-28% i was looking at the ROOT process which takes 70%..

Yes!

lokki ? does ROOT just take all that is left to get to 100% then?

yes. That is the time dedicated to Linux (including threads from your Bela program that run in secondary mode). For reference, if you were using 70% of the CPU for audio, the IDE would basically stop working, and commands on the terminal would be fairly sloppy.

« Previous Page