- Edited
While I was looking into the performance align I noticed there is a Clang command line flag to set the default new() align, might be worth looking at that?
While I was looking into the performance align I noticed there is a Clang command line flag to set the default new() align, might be worth looking at that?
What I found is an ancient discussion of gcc devs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15795 . They mention that according to the c++ standard, the alignment of the pointer returned by operator new
is undefined. As such, the behaviout observed with clang
is the expected behaviour, however this makes all c++ code using the default operator new
for classes with intrinsics unusable on Linux, where the underlying call to malloc()
returns values aligned to 8 byte. On Mac (and I shall assume iOS), malloc()
always returns a pointer aligned to 16 bytes, making this not an issue for many intrinsics.
So yes, please, if you found a suitable flag for clang
, tell us!
Try:
-faligned-new=16
looks like that option is not supported on the clang
/ gcc
on the board. It was probably introduced more recently.
Ah, what version of clang is in the board?
It's 3.9 on the board. clang 6.0
seems to have it, and you can install it with apt-get
from stretch
backport (though I haven't tried: I just downloaded if from llvm.org).
Haven't managed to make it produce the desired code yet, though.
For whatever reason, on clang-3.9
, alignof(float32x4_t)
is 16 bytes, while on clang-6
(and on gcc-6.3
), it is 8 bytes. Especially the clang-3.9 vs 6 is particularly weird, because their respective arm_neon.h
includes are very similar ....
I'm a bit confused!
I though the original issue was that 3.9 wasn't aligning to 16?
this is for android (arm) but seems to talk about the clang problem:
AndyCap I though the original issue was that 3.9 wasn't aligning to 16?
The original issue is that on clang-3.9
, alignof(float32x4_t)
is 16, and it (probably) generates assembly code that assumes the memory to be aligned. If the memory is not effectively aligned to a 16 bytes boundaries, you get a SIGBUS.
However, when allocating with operator new
an object of a class that contains one or more float32x4_t
, there is no guarantee that these will effectively be allocated with the appropriate alignment. The underlying issue is that according to the C++ standard, the alignment of a pointer returned by operator new
is undefined. Hence the workaround above with a custom allocator. It seems that C++17 offers some more clever support for aligned heap allocation, but I didn't get into the details of that.
What is surprising is that while alignof(float32x4_t)
is 16 bytes on clang-3.9
, it is actually 8 bytes on gcc-6.3
and clang-6.0
. I guess that this could explain the both:
- the performance penalty for gcc-6.3
vs clang-3.9
when using intrinsics
- the lack of SIGBUS on gcc-6.3
: the generated assembly does not make assumptions about alignment that will then be neglected at runtime.
Seeing how this is alignof(float32x4_t)
has become 8 bytes in clang-6.0
, I am afraid this may also come with a performance penalty (but I haven't verified it yet). However, I am wondering where that difference comes from, as the arm_neon.h
files for the two versions of clang
provide identical definition for all the neon vector types.
Incidentally, given how the change in alignof
can also change the size of struct
s and class
es, this suggest that header files for libraries should abstract away such classes and structs, so that the library will work fine even if the compiler used to compile it is different from the one used to build the user code that uses it.
so where exactly do i put this in the file? in the public part of the class? sorry if this is obvious....
yes
giuliomoro thanks, this works!
Did some tests for CPU usage with the three compilers. On a v0.3.6b image:
Techno-world:
heavy/gcc-6.3 25%
heavy/clang-3.9 21%
heavy/clang-6.0 24%
Your reverbtest patch:
heavy/gcc-6.3 16%
heavy/clang-3.9 18%
heavy/clang-6.0 16%
noting that, in both cases, clang-3.9
needs the fix above in order to run.
The fact that clang-6.0 is slower on techno-world
is pretty annoying, though.
giuliomoro interesting...
so that is the value, the IDE shows,or how do you measure? might need to check on my actual granular patch...
It's actually the value you see by doing
watch -n 0.5 cat /proc/xenomai/sched/stat
which is what the IDE grabs and displays. Running the above in the terminal is more lightweight than through the IDE, so you can run it even if the CPU usage is so high that the IDE is not responsive.
ok, so i notice something funny:
when the IDE is not running my patch is running at 69 to 70 %, when i run the IDE that value fluctuates from 58 to 70 %
The CPU usage indicator in the IDE takes into account all the threads of the program, and both their Xenomai CPU usage (as shown in /proc/xenomai/sched/stat
) and their Linux CPU usage (as shown by the top
command). So for instance, if you have the scope window open, and you are using the Scope, then the CPU usage will differ between the IDE and the terminal, but in the IDE it should be higher. So I don't think it makes sense to explain your case.
Is there anything in the patch that would change the CPU usage over time? Wanna send it over?
giuliomoro Is there anything in the patch that would change the CPU usage over time? Wanna send it over?
i don't think so. also when the IDE is not running the CPU usage stays the same all the time. note that i used the "watch" method in both cases (with and without IDE) to measure usage.
sure, i'll send the patch.