-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clang-cl can not support function targets #53520
Comments
This is indeed a nasty divergence from MSVC and one that IIRC is actually quite hard to fix as its also an issue in LLVM, not just clang. There is a workaround however as you can use the |
@llvm/issue-subscribers-clang-codegen |
That won't help. The problem as I see it is that immintrin.h does not include non-current subarch targets when _MSC_VER is defined. I think if that was removed it still wouldn't work as MSVC does, but it would atleast work in functions with an appropiate target function attribute (clang/gcc style). |
I am not quite sure as to how my above suggestion differs from clang/gcc style. Using the GCC style #include <immintrin.h>
int main()
{
unsigned int value;
_rdrand32_step(&value);
} and then compiled it using But yes this won't match MSVC behaviour, which is clearly a bug. The above would simply serve as a workaround |
I think we are talking past each other. I am talking about using intrinsics without having similar commandline flags for runtime target detection MSVC style:
clang/gcc style:
This is for runtime detection of CPU feature, not anything enabled at compile time. In MSVC non-target intrinsics always works (somehow), in clang/gcc they work if appropriate function target has been declared where used. In clang-cl neither works because the compile-time flags are checked before defining the functions in the immintrin.h header. |
Please fix this bug in clang-cl, this issue is really annoying. |
I think this has already been explained by Allan, but just to make sure it's clear. This bug is preventing us from writing a single Windows binary that can be optimized for various architectures. For instance, we'd like the binary to work on older CPUs that don't have AVX and on newer machines with AVX we can run AVX optimized functions instead of the generic SSE2 functions. We might even have optimizations for AVX-512. Fixing this can give applications a 4x speedup on Windows while still allowing the application to run on older CPUs. It's not everyday a compiler can do something that gives a 4x speedup, so I think this should be a high priority to fix. |
@ThiagoIze is right, this is performance critical for many applications, please prioritize this issue. |
Any progress on this? |
🤔 |
Does anyone know what the limitation is? If I copy the definitions from avxintrin.h: #define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("avx"), __min_vector_width__(256)))
typedef int __v8si __attribute__ ((__vector_size__ (32)));
typedef long long __m256i __attribute__((__vector_size__(32), __aligned__(32)));
typedef long long __m256i_u __attribute__((__vector_size__(32), __aligned__(1)));
static __inline __m256i __DEFAULT_FN_ATTRS
_mm256_set_epi32(int __i0, int __i1, int __i2, int __i3,
int __i4, int __i5, int __i6, int __i7)
{
return __extension__ (__m256i)(__v8si){ __i7, __i6, __i5, __i4, __i3, __i2, __i1, __i0 };
}
static __inline __m256i __DEFAULT_FN_ATTRS
_mm256_set1_epi32(int __i)
{
return _mm256_set_epi32(__i, __i, __i, __i, __i, __i, __i, __i);
}
static __inline void __DEFAULT_FN_ATTRS
_mm256_storeu_si256(__m256i_u *__p, __m256i __a)
{
struct __storeu_si256 {
__m256i_u __v;
} __attribute__((__packed__, __may_alias__));
((struct __storeu_si256*)__p)->__v = __a;
}
__attribute__((target("avx2"))) void fill(void *ptr, int n)
{
__m256i v = _mm256_set1_epi32(n);
_mm256_storeu_si256((__m256i *)ptr, v);
} it compiles and works just fine (LLVM 15.0.6):
|
clang-cl's intrinsics support is broken, it doesn't declare the AVX2 intrinsics if they are disabled and this doesn't match GCC or MSVC behavior: llvm/llvm-project#53520 This fix allows to disable x86 intrinsiscs during configuration of clang-cl build. clang-cl build is still not guaranteed to work with enabled x86 intrinsics. Change-Id: Icd295f6b4d868adf10bcd425d5280c56b43cb9f7 Reviewed-by: Thiago Macieira <[email protected]>
I've encountered with another similar issue, which prevents me from building Qt using clang-cl: https://bugreports.qt.io/browse/QTBUG-113231. However, I've fixed it and the workaround is really simple. For some unknown reason, clang-cl can't find some intrin functions, so I modified immintrin.h to let the corresponding headers be included unconditionally. And then the compilation error is solved. I'm not sure what I'm doing is correct or not, but at least the compilation now goes smoothly without any errors and the generated binary file also seem to work fine. |
No I suspect that is the correct solution. The stupid include breaks are the issue, and it makes no sense they are there. Just nobody in the project have tried doing it. So it probably takes somebody outside of LLVM to fix this bug. |
Those were added for a reason. The question is whether that reason is still valid. I suspect it isn't: the reason must have been that the |
These header guards were added in 379a195, which is included since llvm 3.9.0 (2016) |
They were added because including this header without them induces ~10-30% compile time overhead, which you often have no say in because it's included by system headers. CC @nico for awareness |
I prefer to pay the penalty of 10 to 30% slowness compared to not being able to compile code that Clang-non-CL and MSVC compile. |
It's something we need to solve, but it's not acceptable to introduce that amount of compile time regression when solving it. |
I agree it's something to solve, but disagree that the cost is unacceptable. As I said, this is the difference between "good compiler generates really good code" (hopefully) and "broken compiler, don't even report bugs to us". If there is a good chance that the compilation time slowness will get sufficiently solved in the short term, then the delay is acceptable. Conversely, if there's no chance of that happening soon (too difficult, no one working on it, etc.), then an indefinite delay is not acceptable. I also don't know much code that includes immintrin.h and family in public headers. They're usually kept in private headers used exclusively for implementations, so the cost in compilation time is reduced. More importantly, they're also the ones that want to use the header and right now can't. |
According to this comment by @StephanTLavavej Microsoft/STL, immintrin.h is included by intrin.h, which seems to be included in a core STL header. If I'm reading that right, every TU that uses the C++ standard library on Windows includes this header? That seems to suggest that a performance impact as high as 10-30% would be quite unacceptable... Unless I'm misreading the STL code, of course. 😅 |
Ah, I see. They're probably using intrinsics for and similar functionality. I think I've seen other uses too in their headers, like in https://github.com/microsoft/STL/blob/091cad2eaaa5bc25873eb7261cae57ab123592f3/stl/inc/bit#L144-L145. libstdc++ and libc++ usually use the Anyway, this does mean the impact of changing immintrin.h is much higher than I'd thought. |
Yup, that's why I was saying the slowdown was not acceptable -- it impacts roughly everything compiled on Windows, which makes this tricky to resolve. That said, I think we need a solution of some kind. |
BTW, do you know if the slowdown is caused by the presence of the |
It's the size of the header file, I believe.
|
@rnk as well. |
Should this issue have a bug label? After all, the behavior does not correspond to the expected behavior and it prevents the use of clang-cl under Windows for various libraries. (Qt for example.) |
This is still broken in mainline clang.
I just ran into this when porting our internal projects to clang-cl. Also not all projects use the STL and actively avoid vendor STLs for one of the reasons outlined above; the insane amount of header includes. In the interim I can work around it by using the Anyways when I am free this weekend assuming there isn't already a PR up to fix this I am going to get one up to try to push a solution forward that satisfies the concerns around MSVC STL and users who need to be able to do run-time detection of cpu support for SSE/AVX. |
We should revisit this, it is unfortunate that the only way to use Intel intrinsics with clang-cl is to add additional command line flags. Both GCC and MSVC can call these intrinsics with only local source changes, either via target attributes or simply directly calling various AVX intrinsics. I believe Intel's last proposal for addressing the compile time concerns was to ship a module map for Clang builtin intrinsic headers, but I think that hasn't advanced because folks are concerned about establishing a hard dependency on Clang header modules. They interfere with pre-processing, crash reproduction, and distributed build systems, and are not entirely aligned with C++ standard modules. Perhaps another avenue for addressing the compile time concerns would be go down the path of providing an intrin0.h file similar to MSVC, which declares the minimal set of intrinsics that the MSVC STL needs, and then we could allow |
That sounds good to me. I'll try to give it a whirl this weekend and get a PR up if someone else doesn't beat me to it :).
This is actually a major compile-time cost for us. Includes headers including all previous headers such as Especially true in games where a lot of these headers end up being included from your math library that is mostly only header files to ensure functions have a chance to be inlined which basically bloats every source file. I am in the process of writing our own SSE headers internally to combat some of this but the less platform/toolchain specific ifdefs that are required the better in my opinion. The PR I was going to get up would allow users to include clang specific isolated headers such as The git blame for, https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avxintrin.h#L11, shows that these checks were added since gcc has this behaviour. My 2c. |
I would prefer if we could use different defines or compile time flags. In a project supporting multiple compilers that is preferable over having different includes. Could we just have a define that unlocks non-target intrinsics? -D_CLANG_CL_INTRINSICS? |
Placing the class id at offset 0 should make `isa` and `dyn_cast` faster by eliminating the field offset (previously 0x10) from the memory operand, saving encoding space on x86, and, in theory, an add micro-op. You can see the load encodes one byte smaller here: https://godbolt.org/z/Whvz4can9 The compile time tracker shows some modestly positive results in the on the `cycle` metric and in the final clang binary size metric: https://llvm-compile-time-tracker.com/compare.php?from=33b54f01fe32030ff60d661a7a951e33360f82ee&to=2530347a57401744293c54f92f9781fbdae3d8c2&stat=cycles Clicking through to the per-library size breakdown shows that instcombine size reduces by 0.68%, which is meaningful, and I believe instcombine is known to be a hotspot. It is, however, potentially noise. I still think we should do this, because notionally, the class id really acts as the vptr of the Value, and conventionally the vptr is always at offset 0.
Placing the class id at offset 0 should make `isa` and `dyn_cast` faster by eliminating the field offset (previously 0x10) from the memory operand, saving encoding space on x86, and, in theory, an add micro-op. You can see the load encodes one byte smaller here: https://godbolt.org/z/Whvz4can9 The compile time tracker shows some modestly positive results in the on the `cycle` metric and in the final clang binary size metric: https://llvm-compile-time-tracker.com/compare.php?from=33b54f01fe32030ff60d661a7a951e33360f82ee&to=2530347a57401744293c54f92f9781fbdae3d8c2&stat=cycles Clicking through to the per-library size breakdown shows that instcombine size reduces by 0.68%, which is meaningful, and I believe instcombine is known to be a hotspot. It is, however, potentially noise. I still think we should do this, because notionally, the class id really acts as the vptr of the Value, and conventionally the vptr is always at offset 0.
…for clang-cl (#75711) Fixes #53520. #### Description #### Provide `intrin0.h` to be the minimal set of intrinsics that the MSVC STL requires. The `intrin0.h` header matches the latest header provided by MSVC 1939 which does include some extra intrinsics that the MSVC STL does not use. Inside `BuiltinHeaders.def` I kept the header description as `intrin.h`. If you want me to change those to `intrin0.h` for the moved intrinsics let me know. This should now allow `immintrin.h` to be used with function targets for runtime cpu detection of simd instruction sets without worrying about the compile-time overhead from MSVC STL including `intrin.h` on clang. I still need to figure out how to best update MSVC STL to detect for the presence of `intrin0.h` from clang and to use this header over `intrin.h`. #### Testing #### Built clang locally and ran the test suite. I still need to do a pass over the existing unit tests for the ms intrinsics to make sure there aren't any gaps. Wanted to get this PR up for discussion first. Modified latest MSVC STL from github to point to `intrin0.h` for clang. Wrote some test files that included MSVC STL headers that rely on intrinsics such as `atomic`, `bit` and `vector`. Built the unit tests against x86, arm, aarch64, and x64. #### Benchmarks #### The following include times are based on the x64 target with the modified headers in this PR. These timings were done by using `clang-cl.exe -ftime-trace` and taking the wall time for parsing `intrin.h` and `intrin0.h`. `intrin.h` takes ~897ms to parse. `intrin0.h` takes ~1ms to parse. If there is anything required or a different approach is preferred let me know. I would very much like to move this over the finish line so we can use function targets with clang-cl.
…for clang-cl (llvm#75711) Fixes llvm#53520. #### Description #### Provide `intrin0.h` to be the minimal set of intrinsics that the MSVC STL requires. The `intrin0.h` header matches the latest header provided by MSVC 1939 which does include some extra intrinsics that the MSVC STL does not use. Inside `BuiltinHeaders.def` I kept the header description as `intrin.h`. If you want me to change those to `intrin0.h` for the moved intrinsics let me know. This should now allow `immintrin.h` to be used with function targets for runtime cpu detection of simd instruction sets without worrying about the compile-time overhead from MSVC STL including `intrin.h` on clang. I still need to figure out how to best update MSVC STL to detect for the presence of `intrin0.h` from clang and to use this header over `intrin.h`. #### Testing #### Built clang locally and ran the test suite. I still need to do a pass over the existing unit tests for the ms intrinsics to make sure there aren't any gaps. Wanted to get this PR up for discussion first. Modified latest MSVC STL from github to point to `intrin0.h` for clang. Wrote some test files that included MSVC STL headers that rely on intrinsics such as `atomic`, `bit` and `vector`. Built the unit tests against x86, arm, aarch64, and x64. #### Benchmarks #### The following include times are based on the x64 target with the modified headers in this PR. These timings were done by using `clang-cl.exe -ftime-trace` and taking the wall time for parsing `intrin.h` and `intrin0.h`. `intrin.h` takes ~897ms to parse. `intrin0.h` takes ~1ms to parse. If there is anything required or a different approach is preferred let me know. I would very much like to move this over the finish line so we can use function targets with clang-cl.
…for clang-cl (llvm#75711) Fixes llvm#53520. #### Description #### Provide `intrin0.h` to be the minimal set of intrinsics that the MSVC STL requires. The `intrin0.h` header matches the latest header provided by MSVC 1939 which does include some extra intrinsics that the MSVC STL does not use. Inside `BuiltinHeaders.def` I kept the header description as `intrin.h`. If you want me to change those to `intrin0.h` for the moved intrinsics let me know. This should now allow `immintrin.h` to be used with function targets for runtime cpu detection of simd instruction sets without worrying about the compile-time overhead from MSVC STL including `intrin.h` on clang. I still need to figure out how to best update MSVC STL to detect for the presence of `intrin0.h` from clang and to use this header over `intrin.h`. #### Testing #### Built clang locally and ran the test suite. I still need to do a pass over the existing unit tests for the ms intrinsics to make sure there aren't any gaps. Wanted to get this PR up for discussion first. Modified latest MSVC STL from github to point to `intrin0.h` for clang. Wrote some test files that included MSVC STL headers that rely on intrinsics such as `atomic`, `bit` and `vector`. Built the unit tests against x86, arm, aarch64, and x64. #### Benchmarks #### The following include times are based on the x64 target with the modified headers in this PR. These timings were done by using `clang-cl.exe -ftime-trace` and taking the wall time for parsing `intrin.h` and `intrin0.h`. `intrin.h` takes ~897ms to parse. `intrin0.h` takes ~1ms to parse. If there is anything required or a different approach is preferred let me know. I would very much like to move this over the finish line so we can use function targets with clang-cl.
The immintrin.h headers has a bug where it does not include sub-arch intrinsics headers if _MSC_VER is defined unless the subarch target is current.
This is inconsistent with MSVC which always defines intrinsics regardless of active arch, and is also inconsistent with normal-clang (gcc-style) which defines them for use with subarch function targets. The result is there is no way of using intrinsics for subarch targets with clang-cl, since neither gcc nor msvc style works.
This has forced us to disable many optimizations in Qt with clang-cl, see https://bugreports.qt.io/browse/QTBUG-88081, https://bugreports.qt.io/browse/QTBUG-88434 and https://bugreports.qt.io/browse/QTBUG-98253
I suggest at least allowing gcc style, and the intrinsics working in target attributed functions, if you can't support the MSVC style.
The text was updated successfully, but these errors were encountered: