- Edited
AndyCap I though the original issue was that 3.9 wasn't aligning to 16?
The original issue is that on clang-3.9
, alignof(float32x4_t)
is 16, and it (probably) generates assembly code that assumes the memory to be aligned. If the memory is not effectively aligned to a 16 bytes boundaries, you get a SIGBUS.
However, when allocating with operator new
an object of a class that contains one or more float32x4_t
, there is no guarantee that these will effectively be allocated with the appropriate alignment. The underlying issue is that according to the C++ standard, the alignment of a pointer returned by operator new
is undefined. Hence the workaround above with a custom allocator. It seems that C++17 offers some more clever support for aligned heap allocation, but I didn't get into the details of that.
What is surprising is that while alignof(float32x4_t)
is 16 bytes on clang-3.9
, it is actually 8 bytes on gcc-6.3
and clang-6.0
. I guess that this could explain the both:
- the performance penalty for gcc-6.3
vs clang-3.9
when using intrinsics
- the lack of SIGBUS on gcc-6.3
: the generated assembly does not make assumptions about alignment that will then be neglected at runtime.
Seeing how this is alignof(float32x4_t)
has become 8 bytes in clang-6.0
, I am afraid this may also come with a performance penalty (but I haven't verified it yet). However, I am wondering where that difference comes from, as the arm_neon.h
files for the two versions of clang
provide identical definition for all the neon vector types.
Incidentally, given how the change in alignof
can also change the size of struct
s and class
es, this suggest that header files for libraries should abstract away such classes and structs, so that the library will work fine even if the compiler used to compile it is different from the one used to build the user code that uses it.