Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

20%+ regression in ReleaseFast performance since 0.11.0 #17768

Open
scheibo opened this issue Oct 28, 2023 · 7 comments
Open

20%+ regression in ReleaseFast performance since 0.11.0 #17768

scheibo opened this issue Oct 28, 2023 · 7 comments
Labels
backend-llvm The LLVM backend outputs an LLVM IR Module. optimization regression It worked in a previous version of Zig, but stopped working.
Milestone

Comments

@scheibo
Copy link
Contributor

scheibo commented Oct 28, 2023

Zig Version

0.12.0-dev.876+aaf46187a

Steps to Reproduce and Observed Behavior

I noticed a regression on my project's benchmark and bisected it via nightlies - zig-macos-aarch64-0.12.0-dev.866+3a47bc715 produces code that runs ~10% faster than zig-macos-aarch64-0.12.0-dev.876+aaf46187a:

$ git clone https://github.com/pkmn/engine.git
<snip>
$ cd engine
$ git reset --hard 12e7292c9df3ad73c9681205fb3acfee1536c425
$ zig version
0.12.0-dev.866+3a47bc715
$ zig build benchmark --  1 1000/10000 281483566841860
122294722,1232901,1471351722788935846

# change zig versions
$ zig version
0.12.0-dev.876+aaf46187a
$ zig build benchmark --  1 1000/10000 281483566841860
137394552,1232901,1471351722788935846

The benchmark tool uses std.time.Timer to perform its own internal timing which it prints out as the first number there (second two numbers are for confirming the benchmarks are computing the same results).

Alternatively, since the regression is large enough you can literally just use time (though this is measuring a different thing than the internal benchmark timer, but thats kind of unimportant from Zig's POV):

$ zig version
0.12.0-dev.866+3a47bc715
$ zig build --verbose benchmark --  1 1000/10000 281483566841860
<grab actual binary from output to run>
$ time ./zig-cache/o/7cb294da313404bb8f51f4c304d47793/benchmark 1 1000/10000 281483566841860
124231503,1232901,1471351722788935846

real	0m0.178s
user	0m0.168s
sys	0m0.005s

# change zig versions
$ zig version
0.12.0-dev.876+aaf46187a
$ zig build --verbose benchmark --  1 1000/10000 281483566841860
<grab actual binary from output to run>
$ time ./zig-cache/o/30082df782561e9cee72e116248d663a/benchmark 1 1000/10000 281483566841860
136831126,1232901,1471351722788935846

real	0m0.198s
user	0m0.182s
sys	0m0.006s

Expected Behavior

There not to be a regression :)

From 3a47bc7...aaf4618 I'm guessing #17391 is a likely suspect?

@scheibo scheibo added the bug Observed behavior contradicts documented or intended behavior label Oct 28, 2023
@andrewrk andrewrk added optimization backend-llvm The LLVM backend outputs an LLVM IR Module. regression It worked in a previous version of Zig, but stopped working. and removed bug Observed behavior contradicts documented or intended behavior labels Oct 28, 2023
@andrewrk andrewrk added this to the 0.12.0 milestone Oct 28, 2023
@xxxbxxx
Copy link
Contributor

xxxbxxx commented Oct 29, 2023

mmm interresting.

looks like reverting the change does indeed restore the perf.

--- a/src/codegen/llvm.zig
+++ b/src/codegen/llvm.zig
@@ -10528,7 +10528,8 @@ pub const FuncGen = struct {
             if (isByRef(elem_ty, mod)) {
                 return self.loadByRef(ptr, elem_ty, ptr_alignment, access_kind);
             }
-            return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+            //return self.loadTruncate(access_kind, elem_ty, ptr, ptr_alignment);
+            return self.wip.load(access_kind, try o.lowerType(elem_ty), ptr, ptr_alignment, "");
         }
 
         const containing_int_ty = try o.builder.intType(@intCast(info.packed_offset.host_size * 8));

a quick look with perf report shows the main difference comes from
gen1.data.Battle(common.rng.PRNG(1)).choice that uses a u4 loop index.

            var slot: u4 = 2;
            while (slot <= 6) : (slot += 1) {
                const id = side.order[slot - 1];
                if (id == 0 or side.pokemon[id - 1].hp == 0) continue;
                out[n] = .{ .type = .Switch, .data = slot };
                n += 1;
            }

The loop (and others in the function) was previously unrolled by llvm, and no longer is.

@xxxbxxx
Copy link
Contributor

xxxbxxx commented Oct 29, 2023

I've tried to poke at it a little bit, but no idea how to fix this.

Short of doing some kind of range propagation pass or something, I'm not quite sure how it is possible distinguish between "this is a nice clean local variable" and "this may contain some uninitialized left over bits" (as in #14200)

(but then, I know nothing about llvm or zig internals, so maybe there's a way...)

Of course, as a workaround, changing the loop counter to u8 (and adding a few @intCast()) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.

@scheibo
Copy link
Contributor Author

scheibo commented Oct 29, 2023

Thanks for digging in, @xxxbxxx! Its nice to know that I can work around this with @intCast and different loop counter types throughout the code, though I'm a little worried in general about the performance cliff/footgun so I don't know what Zig wants to do here

@scheibo
Copy link
Contributor Author

scheibo commented Nov 25, 2023

Of course, as a workaround, changing the loop counter to u8 (and adding a few @intCast()) removes the truncations on the loop index and allows llvm heuristics to work and unroll/simplify the code.

Changing all non-power-of-2 loop counters to u8 or usize and then adding @intCast()s solved half of the problem (i.e. this is now a 5% regression instead of a 10% regression), but im still stuck with a sizable regression.

I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.

@xxxbxxx
Copy link
Contributor

xxxbxxx commented Nov 26, 2023

I'm also confused as to why #17391 is problematic if its supposed to only be extending a change that didn't cause any regression to "optional and unions", and my project does not use unions and has no non-byte-sized optionals.

that's a misunderstanding: the "change that didn't cause any regression" is indeed the change triggering the performance issue...

xxxbxxx added a commit to xxxbxxx/zig that referenced this issue Nov 26, 2023
forcing llvm to truncate the padding bits prevents some optimizations.

fixes 370662c
see ziglang#17768
@andrewrk andrewrk modified the milestones: 0.12.0, 0.13.0 Mar 20, 2024
andrewrk pushed a commit to xxxbxxx/zig that referenced this issue May 9, 2024
forcing llvm to truncate the padding bits prevents some optimizations.

fixes 370662c
see ziglang#17768
@Inqnuam
Copy link

Inqnuam commented May 27, 2024

Some notable regressions here 👇
~24% regression when comparing zig 11 vs 12/13 with ReleaseFast
~33% regression when comparing zig 11 vs 12/13 with ReleaseSmall

macOS: 13.6.6
CPU: Intel Core i9-9900K
zigV build-lib --gc-sections -fsingle-threaded -dead_strip -dead_strip_dylibs -mcpu=native -OReleaseMODE -dynamic -lc  src/lib.zig -fallow-shlib-undefined -femit-bin=dist/lib-V-MODE
Screenshot 2024-05-27 at 18 54 25
ZIG 11 Fast x 20,166 ops/sec ±0.35% (96 runs sampled)
ZIG 11 Small x 14,645 ops/sec ±0.39% (95 runs sampled)
ZIG 12 Fast x 16,136 ops/sec ±0.36% (97 runs sampled)
ZIG 12 Small x 10,988 ops/sec ±0.33% (94 runs sampled)
ZIG 13 Fast x 16,354 ops/sec ±0.33% (96 runs sampled)
ZIG 13 Small x 11,420 ops/sec ±0.37% (97 runs sampled)

🚀 Fastest: ZIG 11 Fast
🐌 Slowest: ZIG 12 Small

Would love to have zig 13 Fast's size with zig 11 Fast's speed 😄

xxxbxxx added a commit to xxxbxxx/zig that referenced this issue Aug 24, 2024
forcing llvm to truncate the padding bits prevents some optimizations.

fixes 370662c
see ziglang#17768
scheibo added a commit to pkmn/engine that referenced this issue Aug 24, 2024
per
ziglang/zig#17768 (comment)
this removes truncation on the loop index and allows LLVM to better
simplify the code. This claws back half of the 10% regression,
though there is still a 5% gap.
scheibo added a commit to pkmn/engine that referenced this issue Aug 24, 2024
per
ziglang/zig#17768 (comment)
this removes truncation on the loop index and allows LLVM to better
simplify the code. This claws back half of the 10% regression,
though there is still a 5% gap.
xxxbxxx added a commit to xxxbxxx/zig that referenced this issue Aug 27, 2024
forcing llvm to truncate the padding bits prevents some optimizations.

fixes 370662c
see ziglang#17768
scheibo added a commit to pkmn/engine that referenced this issue Aug 29, 2024
Required as a workaround for ziglang/zig#17768 - currently the
performance is around 20% slower on master which is unacceptable.
This is obviously somewhat of a bummer (RIP 3a62fd3) and will be
a pretty big maintenance headache but its been almost a year without
(positive) progress on the Zig side so there's not really any better
options.

Clean `make integration` on the following builds:

  0.11.0
  0.12.0
  0.12.0-dev.789+e6590fea1
  0.12.0-dev.866+3a47bc715
  0.12.0-dev.876+aaf46187a
  0.12.0-dev.1396+f6de3ec96
  0.12.0-dev.1879+e19219fa0
  0.12.0-dev.2036+fc79b22a9
  0.12.0-dev.2665+919a3bae1
  0.12.0-dev.3644+05d975576
  0.12.1
  0.13.0
  0.13.0-dev.8+c352845e8
  0.13.0-dev.39+f6f7a47aa
  0.13.0-dev.46+3648d7df1
  0.14.0-dev.23+d9bd34fd0
  0.14.0-dev.564+75cf7fca9
  0.14.0-dev.994+9f46abf59
@scheibo
Copy link
Contributor Author

scheibo commented Aug 29, 2024

I just spent the day reverting my project to 0.11.0 - as noticed by @Inqnuam I'm now actually seeing 20%+ regression, not just 10% (and @xxxbxxx's suggestion above) no longer seems to move the needle much at all.

My project builds with the Zig compiler at HEAD and all the way back to 0.11.0. I don't know how long that will possible in the wake of breaking language changes (it seems like only breaking standard library and build system changes have occurred since 0.11.0), but currently I feel it would serve as a uniquely suitable testbed for someone interesting in attempting to improve compiler performance or the performance of compiled output. I would be very surprised if other projects haven't also experienced at least some sort of slowdown since 0.11.0, I just imagine such a slowdown is harder to attribute directly to Zig in the same way my project is able to due to the difficulties of supporting multiple Zig versions and how much project's code and feature set would usually change over time.

@scheibo scheibo changed the title 10% regression in ReleaseFast performance introduced in 0.12.0-dev.876+aaf46187a 20%+ regression in ReleaseFast performance since 0.11.0 Aug 29, 2024
baskuit pushed a commit to baskuit/engine that referenced this issue Sep 1, 2024
Required as a workaround for ziglang/zig#17768 - currently the
performance is around 20% slower on master which is unacceptable.
This is obviously somewhat of a bummer (RIP 3a62fd3) and will be
a pretty big maintenance headache but its been almost a year without
(positive) progress on the Zig side so there's not really any better
options.

Clean `make integration` on the following builds:

  0.11.0
  0.12.0
  0.12.0-dev.789+e6590fea1
  0.12.0-dev.866+3a47bc715
  0.12.0-dev.876+aaf46187a
  0.12.0-dev.1396+f6de3ec96
  0.12.0-dev.1879+e19219fa0
  0.12.0-dev.2036+fc79b22a9
  0.12.0-dev.2665+919a3bae1
  0.12.0-dev.3644+05d975576
  0.12.1
  0.13.0
  0.13.0-dev.8+c352845e8
  0.13.0-dev.39+f6f7a47aa
  0.13.0-dev.46+3648d7df1
  0.14.0-dev.23+d9bd34fd0
  0.14.0-dev.564+75cf7fca9
  0.14.0-dev.994+9f46abf59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend-llvm The LLVM backend outputs an LLVM IR Module. optimization regression It worked in a previous version of Zig, but stopped working.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants