-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize reverse for 32-bit trivial types (optimize for all Intel and recent AMD, pessimize for Excavator/Zen/Zen+) #2383
Conversation
Since the optimization is certainly dependent on CPU microarchitecture, I've looked into https://uops.info data to see the results for different CPUs. I was primarily looking into Throughput (TP). This appeared to be a significant win for Intel Skylake and earlier, a smaller win for Intel Ice Lake and later. Out of AMD, upos.info only has Zen+, Zen 2, Zen 3. It appears to be some loss for Zen+ while still a small win for Zen 2 and Zen 3. @CaseyCarter confirmed a small win for Zen 2 (3950X) and @StephanTLavavej confirmed a small win for Zen 3 (5950X) |
I'm now also suspecting that original code might be not specifically optimized for some CPU that for which it is more efficient. Rather it may be a simple upscaling of SSE code to AVX. In this case the existing 32-bit implementation makes sense. |
Looks good to me, thanks. The resulting code is both shorter and simpler to understand (as the intrinsic directly does what we want). I confirmed that this intrinsic is AVX2 so it's correctly guarded. The performance results you've collected are compelling - while of course it would be nice to have a pure win across the board, having wins for all modern processors (2019+) and slight losses for older processors is acceptable. (Especially considering that not too long ago, we shipped completely unvectorized |
whew, sorry it took so long. I got tangled up in the AVX2 documentation several times, this looks equivalent to the old code. The mask parameter could be "constexpr" but I think it's merely const intentionally, as the intent is to use an indirect version of the instruction (without an immediate), at least that's my understanding |
No, just an oversight. Could be Please push the change if you think ith worth resetting testing. |
Actually disregard that. This one is also |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
😻 🎉 😹 !uoy knahT |
change vpermq+vpshufd to vpermd
(see https://uops.info/table.html)
Benchmark: