20%+ regression in ReleaseFast performance since 0.11.0 #17768

scheibo · 2023-10-28T17:46:37Z

Zig Version

0.12.0-dev.876+aaf46187a

Steps to Reproduce and Observed Behavior

I noticed a regression on my project's benchmark and bisected it via nightlies - zig-macos-aarch64-0.12.0-dev.866+3a47bc715 produces code that runs ~10% faster than zig-macos-aarch64-0.12.0-dev.876+aaf46187a:

$ git clone https://github.com/pkmn/engine.git
<snip>
$ cd engine
$ git reset --hard 12e7292c9df3ad73c9681205fb3acfee1536c425
$ zig version
0.12.0-dev.866+3a47bc715
$ zig build benchmark --  1 1000/10000 281483566841860
122294722,1232901,1471351722788935846

# change zig versions
$ zig version
0.12.0-dev.876+aaf46187a
$ zig build benchmark --  1 1000/10000 281483566841860
137394552,1232901,1471351722788935846

The benchmark tool uses std.time.Timer to perform its own internal timing which it prints out as the first number there (second two numbers are for confirming the benchmarks are computing the same results).

Alternatively, since the regression is large enough you can literally just use time (though this is measuring a different thing than the internal benchmark timer, but thats kind of unimportant from Zig's POV):

$ zig version
0.12.0-dev.866+3a47bc715
$ zig build --verbose benchmark --  1 1000/10000 281483566841860
<grab actual binary from output to run>
$ time ./zig-cache/o/7cb294da313404bb8f51f4c304d47793/benchmark 1 1000/10000 281483566841860
124231503,1232901,1471351722788935846

real	0m0.178s
user	0m0.168s
sys	0m0.005s

# change zig versions
$ zig version
0.12.0-dev.876+aaf46187a
$ zig build --verbose benchmark --  1 1000/10000 281483566841860
<grab actual binary from output to run>
$ time ./zig-cache/o/30082df782561e9cee72e116248d663a/benchmark 1 1000/10000 281483566841860
136831126,1232901,1471351722788935846

real	0m0.198s
user	0m0.182s
sys	0m0.006s

Expected Behavior

There not to be a regression :)

From 3a47bc7...aaf4618 I'm guessing #17391 is a likely suspect?

The text was updated successfully, but these errors were encountered:

xxxbxxx · 2023-10-29T07:58:17Z

mmm interresting.

looks like reverting the change does indeed restore the perf.

--- a/src/codegen/llvm.zig
+++ b/src/codegen/llvm.zig
@@ -10528,7 +10528,8 @@ pub const FuncGen = struct {
             if (isByRef(elem_ty, mod)) {
                 return self.loadByRef(ptr, elem_ty, ptr_alignment, access_kind);
             }
-            return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+            //return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+            return self.wip.load(access_kind, try o.lowerType(elem_ty), ptr, ptr_alignment, "");
         }
 
         const containing_int_ty = try o.builder.intType(@intCast(info.packed_offset.host_size * 8));

a quick look with perf report shows the main difference comes from
gen1.data.Battle(common.rng.PRNG(1)).choice that uses a u4 loop index.

            var slot: u4 = 2;
            while (slot <= 6) : (slot += 1) {
                const id = side.order[slot - 1];
                if (id == 0 or side.pokemon[id - 1].hp == 0) continue;
                out[n] = .{ .type = .Switch, .data = slot };
                n += 1;
            }

The loop (and others in the function) was previously unrolled by llvm, and no longer is.

xxxbxxx · 2023-10-29T18:47:53Z

I've tried to poke at it a little bit, but no idea how to fix this.

Short of doing some kind of range propagation pass or something, I'm not quite sure how it is possible distinguish between "this is a nice clean local variable" and "this may contain some uninitialized left over bits" (as in #14200)

(but then, I know nothing about llvm or zig internals, so maybe there's a way...)

Of course, as a workaround, changing the loop counter to u8 (and adding a few @intCast()) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.

scheibo · 2023-10-29T18:52:15Z

Thanks for digging in, @xxxbxxx! Its nice to know that I can work around this with @intCast and different loop counter types throughout the code, though I'm a little worried in general about the performance cliff/footgun so I don't know what Zig wants to do here

scheibo · 2023-11-25T20:07:29Z

Of course, as a workaround, changing the loop counter to u8 (and adding a few @intCast()) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.

Changing all non-power-of-2 loop counters to u8 or usize and then adding @intCast()s solved half of the problem (i.e. this is now a 5% regression instead of a 10% regression), but im still stuck with a sizable regression.

I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.

xxxbxxx · 2023-11-26T21:00:35Z

I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.

that's a misunderstanding: the "change that didn't cause any regression" is indeed the change triggering the performance issue...

forcing llvm to truncate the padding bits prevents some optimizations. fixes 370662c see ziglang#17768

Inqnuam · 2024-05-27T17:28:14Z

Some notable regressions here 👇
~24% regression when comparing zig 11 vs 12/13 with ReleaseFast
~33% regression when comparing zig 11 vs 12/13 with ReleaseSmall

macOS: 13.6.6
CPU: Intel Core i9-9900K

zigV build-lib --gc-sections -fsingle-threaded -dead_strip -dead_strip_dylibs -mcpu=native -OReleaseMODE -dynamic -lc  src/lib.zig -fallow-shlib-undefined -femit-bin=dist/lib-V-MODE

ZIG 11 Fast x 20,166 ops/sec ±0.35% (96 runs sampled)
ZIG 11 Small x 14,645 ops/sec ±0.39% (95 runs sampled)
ZIG 12 Fast x 16,136 ops/sec ±0.36% (97 runs sampled)
ZIG 12 Small x 10,988 ops/sec ±0.33% (94 runs sampled)
ZIG 13 Fast x 16,354 ops/sec ±0.33% (96 runs sampled)
ZIG 13 Small x 11,420 ops/sec ±0.37% (97 runs sampled)

🚀 Fastest: ZIG 11 Fast
🐌 Slowest: ZIG 12 Small

Would love to have zig 13 Fast's size with zig 11 Fast's speed 😄

forcing llvm to truncate the padding bits prevents some optimizations. fixes 370662c see ziglang#17768

per ziglang/zig#17768 (comment) this removes truncation on the loop index and allows LLVM to better simplify the code. This claws back half of the 10% regression, though there is still a 5% gap.

forcing llvm to truncate the padding bits prevents some optimizations. fixes 370662c see ziglang#17768

Required as a workaround for ziglang/zig#17768 - currently the performance is around 20% slower on master which is unacceptable. This is obviously somewhat of a bummer (RIP 3a62fd3) and will be a pretty big maintenance headache but its been almost a year without (positive) progress on the Zig side so there's not really any better options. Clean `make integration` on the following builds: 0.11.0 0.12.0 0.12.0-dev.789+e6590fea1 0.12.0-dev.866+3a47bc715 0.12.0-dev.876+aaf46187a 0.12.0-dev.1396+f6de3ec96 0.12.0-dev.1879+e19219fa0 0.12.0-dev.2036+fc79b22a9 0.12.0-dev.2665+919a3bae1 0.12.0-dev.3644+05d975576 0.12.1 0.13.0 0.13.0-dev.8+c352845e8 0.13.0-dev.39+f6f7a47aa 0.13.0-dev.46+3648d7df1 0.14.0-dev.23+d9bd34fd0 0.14.0-dev.564+75cf7fca9 0.14.0-dev.994+9f46abf59

scheibo · 2024-08-29T22:35:54Z

I just spent the day reverting my project to 0.11.0 - as noticed by @Inqnuam I'm now actually seeing 20%+ regression, not just 10% (and @xxxbxxx's suggestion above) no longer seems to move the needle much at all.

My project builds with the Zig compiler at HEAD and all the way back to 0.11.0. I don't know how long that will possible in the wake of breaking language changes (it seems like only breaking standard library and build system changes have occurred since 0.11.0), but currently I feel it would serve as a uniquely suitable testbed for someone interesting in attempting to improve compiler performance or the performance of compiled output. I would be very surprised if other projects haven't also experienced at least some sort of slowdown since 0.11.0, I just imagine such a slowdown is harder to attribute directly to Zig in the same way my project is able to due to the difficulties of supporting multiple Zig versions and how much project's code and feature set would usually change over time.

Required as a workaround for ziglang/zig#17768 - currently the performance is around 20% slower on master which is unacceptable. This is obviously somewhat of a bummer (RIP 3a62fd3) and will be a pretty big maintenance headache but its been almost a year without (positive) progress on the Zig side so there's not really any better options. Clean `make integration` on the following builds: 0.11.0 0.12.0 0.12.0-dev.789+e6590fea1 0.12.0-dev.866+3a47bc715 0.12.0-dev.876+aaf46187a 0.12.0-dev.1396+f6de3ec96 0.12.0-dev.1879+e19219fa0 0.12.0-dev.2036+fc79b22a9 0.12.0-dev.2665+919a3bae1 0.12.0-dev.3644+05d975576 0.12.1 0.13.0 0.13.0-dev.8+c352845e8 0.13.0-dev.39+f6f7a47aa 0.13.0-dev.46+3648d7df1 0.14.0-dev.23+d9bd34fd0 0.14.0-dev.564+75cf7fca9 0.14.0-dev.994+9f46abf59

scheibo added the bug Observed behavior contradicts documented or intended behavior label Oct 28, 2023

andrewrk added optimization backend-llvm The LLVM backend outputs an LLVM IR Module. regression It worked in a previous version of Zig, but stopped working. and removed bug Observed behavior contradicts documented or intended behavior labels Oct 28, 2023

andrewrk added this to the 0.12.0 milestone Oct 28, 2023

andrewrk mentioned this issue Oct 28, 2023

codegen/llvm: truncate padding bits when loading a non-byte-sized value #17391

Merged

xxxbxxx added a commit to xxxbxxx/zig that referenced this issue Nov 26, 2023

codegen/llvm: improve loadTruncate() to use a simple load when possible

319d1c2

forcing llvm to truncate the padding bits prevents some optimizations. fixes 370662c see ziglang#17768

xxxbxxx mentioned this issue Nov 26, 2023

codegen/llvm: improve loadTruncate() to use a simple load when possible #18137

Closed

andrewrk modified the milestones: 0.12.0, 0.13.0 Mar 20, 2024

andrewrk pushed a commit to xxxbxxx/zig that referenced this issue May 9, 2024

codegen/llvm: improve loadTruncate() to use a simple load when possible

5a99ca7

forcing llvm to truncate the padding bits prevents some optimizations. fixes 370662c see ziglang#17768

xxxbxxx added a commit to xxxbxxx/zig that referenced this issue Aug 24, 2024

codegen/llvm: improve loadTruncate() to use a simple load when possible

1ce1723

forcing llvm to truncate the padding bits prevents some optimizations. fixes 370662c see ziglang#17768

xxxbxxx mentioned this issue Aug 24, 2024

Suboptimal codegen for integer variables whose bit width isn't a multiple of 8 #21121

Open

xxxbxxx added a commit to xxxbxxx/zig that referenced this issue Aug 27, 2024

codegen/llvm: improve loadTruncate() to use a simple load when possible

70dc6fa

forcing llvm to truncate the padding bits prevents some optimizations. fixes 370662c see ziglang#17768

scheibo changed the title ~~10% regression in ReleaseFast performance introduced in 0.12.0-dev.876+aaf46187a~~ 20%+ regression in ReleaseFast performance since 0.11.0 Aug 29, 2024

scheibo mentioned this issue Nov 4, 2024

Packed struct operations are not emitted as operations on their backing integer #21891

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20%+ regression in ReleaseFast performance since 0.11.0 #17768

20%+ regression in ReleaseFast performance since 0.11.0 #17768

scheibo commented Oct 28, 2023 •

edited

Loading

xxxbxxx commented Oct 29, 2023

xxxbxxx commented Oct 29, 2023

scheibo commented Oct 29, 2023

scheibo commented Nov 25, 2023 •

edited

Loading

xxxbxxx commented Nov 26, 2023

Inqnuam commented May 27, 2024

scheibo commented Aug 29, 2024

20%+ regression in ReleaseFast performance since 0.11.0 #17768

20%+ regression in ReleaseFast performance since 0.11.0 #17768

Comments

scheibo commented Oct 28, 2023 • edited Loading

Zig Version

Steps to Reproduce and Observed Behavior

Expected Behavior

xxxbxxx commented Oct 29, 2023

xxxbxxx commented Oct 29, 2023

scheibo commented Oct 29, 2023

scheibo commented Nov 25, 2023 • edited Loading

xxxbxxx commented Nov 26, 2023

Inqnuam commented May 27, 2024

scheibo commented Aug 29, 2024

scheibo commented Oct 28, 2023 •

edited

Loading

scheibo commented Nov 25, 2023 •

edited

Loading