Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: sporadic memory corruption on 386 (32-bit) builders #37881

Closed
bcmills opened this issue Mar 16, 2020 · 25 comments
Closed

cmd/compile: sporadic memory corruption on 386 (32-bit) builders #37881

bcmills opened this issue Mar 16, 2020 · 25 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Mar 16, 2020

Various sweep increased allocation count errors have started cropping up in or near go/build invocations in various go commands.

2020-03-15T08:13:55-32dbccd/linux-386-clang
2020-03-14T07:03:15-d774d97/linux-386-sid
2020-03-16T20:59:27-ff1eb42/linux-386-387
2020-03-15T08:13:55-32dbccd/linux-386-clang
2020-03-14T07:03:15-d774d97/linux-386-sid
2020-03-13T20:43:12-e2a9ea0/openbsd-386-62

Since go/build is involved, this may be related to the go/types crash cluster (#37602, #37507, #37690).

CC @griesemer @matloob @aclements @danscales @mknyszek @cherrymui

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 16, 2020
@bcmills bcmills added this to the Go1.15 milestone Mar 16, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Mar 17, 2020

These found bad pointer in Go heap errors seem to fit the same root cause:

2020-03-16T22:31:39-2fbca94/linux-386-sid
2020-03-13T19:43:47-cbcb031/openbsd-386-62

@bcmills bcmills changed the title runtime: "sweep increased allocation count" during cmd/go package loading on 32-bit builders runtime: memory corruption in cmd/go on 32-bit builders Mar 17, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Mar 17, 2020

This appears to be a dramatic uptick in memory corruption across the board on the 32-bit builders.

Given that it is only appearing on the 32-bit builders, and that we have not (to my knowledge) made any particularly racy or dangerous changes in cmd/go this cycle, marking as release-blocker for Go 1.15.

@bcmills
Copy link
Contributor Author

bcmills commented Mar 17, 2020

I also can't rule out a compiler bug as the cause, given the number of changes to the rewrite rules so far. (CC @josharian @randall77)

@bcmills
Copy link
Contributor Author

bcmills commented Mar 17, 2020

The first failure in this cluster was the same day that CL 222782 (#36468) was merged, so I suspect that may be related.

@bcmills bcmills changed the title runtime: memory corruption in cmd/go on 32-bit builders cmd/compile: sporadic memory corruption on 32-bit builders Mar 17, 2020
@bcmills
Copy link
Contributor Author

bcmills commented Mar 17, 2020

Note that some of the affected builders are also TryBots:
https://go-review.googlesource.com/c/go/+/223745/5#message-7c0b4f426d49ae03a1ec88007cf714c5134d4800

@ianlancetaylor
Copy link
Member

@jayconrod
Copy link
Contributor

@randall77
Copy link
Contributor

I'm kinda stumped on this one. I've tried to reproduce to no avail, and pouring over CL 222782 doesn't reveal anything that might cause intermittent issues like this.

On a possibly related note, I just mailed a change to add more addressing mode modifications for amd64. When that goes in, whether or not we start seeing this on amd64 will be illuminating.

@bcmills bcmills changed the title cmd/compile: sporadic memory corruption on 32-bit builders cmd/compile: sporadic memory corruption on 386 (32-bit) builders Mar 20, 2020
@ianlancetaylor
Copy link
Member

@griesemer
Copy link
Contributor

I have not followed this much and I am shooting from the hip here, but perhaps rolling back the rewrite rule changes just to rule them out might be helpful (and then role them back in, selectively). That would be cheap in terms of man-power.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/224457 mentions this issue: cmd/compile: indexed loads/stores can't be faultOnNilArg0

@mknyszek
Copy link
Contributor

I think this is another example? This time in a trybot. https://storage.googleapis.com/go-build-log/899dcded/linux-386_238c26dc.log

gopherbot pushed a commit that referenced this issue Mar 21, 2020
Because of the index, these ops can't guarantee faulting if arg0 is nil.

Clean up the PPC64 index ops - they can't take a sym or an offset.

Noticed while debugging #37881. I don't think it is the cause, but I guess
there is a chance.
Update #37881

Change-Id: Ic22925250bf7b1ba64e3cea1a65638bc4bab390c
Reviewed-on: https://go-review.googlesource.com/c/go/+/224457
Run-TryBot: Keith Randall <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Cherry Zhang <[email protected]>
@rasky
Copy link
Member

rasky commented Mar 22, 2020

New failure today, so I think CL224457 has not fixed it:
https://storage.googleapis.com/go-build-log/1437e11f/windows-386-2008_f0315367.log

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/224837 mentions this issue: cmd/compile: disable mem+op operations on 386

gopherbot pushed a commit that referenced this issue Mar 23, 2020
Rolling back portions of CL 222782 to see if that helps
issue #37881 any.

Update #37881

Change-Id: I9cc3ff8c469fa5e4b22daec715d04148033f46f7
Reviewed-on: https://go-review.googlesource.com/c/go/+/224837
Run-TryBot: Keith Randall <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Bryan C. Mills <[email protected]>
@randall77
Copy link
Contributor

Looks like that last CL didn't help - there was still a failure:
https://build.golang.org/log/91e3e48c8707daaafebec715e68ffea4d1e2adec
(freebsd 12_0 386)
Time to remove some more of that CL.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/225057 mentions this issue: cmd/compile: disable addressingmodes pass for 386

gopherbot pushed a commit that referenced this issue Mar 23, 2020
Update #37881

Change-Id: I1f9a3f57f6215a19c31765c257ee78715eab36b7
Reviewed-on: https://go-review.googlesource.com/c/go/+/225057
Run-TryBot: Keith Randall <[email protected]>
Reviewed-by: Bryan C. Mills <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/225197 mentions this issue: Revert "cmd/compile: disable addressingmodes pass for 386"

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/225217 mentions this issue: Revert "cmd/compile: disable mem+op operations on 386"

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/225218 mentions this issue: Revert "cmd/compile: convert 386 port to use addressing modes pass"

gopherbot pushed a commit that referenced this issue Mar 24, 2020
This reverts commit CL 225057.

Reason for revert: Undoing partial reverts of CL 222782

Update #37881

Change-Id: Iee024cab2a580a37a0fc355e0e3c5ad3d8fdaf7d
Reviewed-on: https://go-review.googlesource.com/c/go/+/225197
Reviewed-by: Bryan C. Mills <[email protected]>
gopherbot pushed a commit that referenced this issue Mar 24, 2020
This reverts commit CL 224837.

Reason for revert: Reverting partial reverts of 222782.

Update #37881

Change-Id: Ie9bf84d6e17ed214abe538965e5ff03936886826
Reviewed-on: https://go-review.googlesource.com/c/go/+/225217
Reviewed-by: Bryan C. Mills <[email protected]>
gopherbot pushed a commit that referenced this issue Mar 24, 2020
This reverts commit CL 222782.

Reason for revert: Reverting to see if 386 errors go away

Update #37881

Change-Id: I74f287404c52414db1b6ff1649effa4ed9e5cc0c
Reviewed-on: https://go-review.googlesource.com/c/go/+/225218
Reviewed-by: Bryan C. Mills <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/225798 mentions this issue: cmd/compile: convert 386 port to use addressing modes pass (take 2)

@randall77
Copy link
Contributor

I might have figured out what is wrong. Read the description of the CL above for all the gory details.
I'm not 100% sure, but enough so to try and submit this again and see what happens (I still can't reproduce outside trybots).

gopherbot pushed a commit that referenced this issue Mar 27, 2020
Retrying CL 222782, with a fix that will hopefully stop the random crashing.

The issue with the previous CL is that it does pointer arithmetic
in a way that may briefly generate an out-of-bounds pointer. If an
interrupt happens to occur in that state, the referenced object may
be collected incorrectly.

Suppose there was code that did s[x+c].  The previous CL had a rule
to the effect of ptr + (x + c) -> c + (ptr + x).  But ptr+x is not
guaranteed to point to the same object as ptr. In contrast,
ptr+(x+c) is guaranteed to point to the same object as ptr, because
we would have already checked that x+c is in bounds.

For example, strconv.trim used to have this code:
  MOVZX -0x1(BX)(DX*1), BP
  CMPL $0x30, AL
After CL 222782, it had this code:
  LEAL 0(BX)(DX*1), BP
  CMPB $0x30, -0x1(BP)

An interrupt between those last two instructions could see BP pointing
outside the backing store of the slice involved.

It's really hard to actually demonstrate a bug. First, you need to
have an interrupt occur at exactly the right time. Then, there must
be no other pointers to the object in question. Since the interrupted
frame will be scanned conservatively, there can't even be a dead
pointer in another register or on the stack. (In the example above,
a bug can't happen because BX still holds the original pointer.)
Then, the object in question needs to be collected (or at least
scanned?) before the interrupted code continues.

This CL needs to handle load combining somewhat differently than CL 222782
because of the new restriction on arithmetic. That's the only real
difference (other than removing the bad rules) from that old CL.

This bug is also present in the amd64 rewrite rules, and we haven't
seen any crashing as a result. I will fix up that code similarly to
this one in a separate CL.

Update #37881

Change-Id: I5f0d584d9bef4696bfe89a61ef0a27c8d507329f
Reviewed-on: https://go-review.googlesource.com/c/go/+/225798
Run-TryBot: Keith Randall <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Cherry Zhang <[email protected]>
@bcmills
Copy link
Contributor Author

bcmills commented Mar 27, 2020

Unfortunately, https://build.golang.org/log/b634dbda2b877a9859e65cd5eba87577e9e33dee looks like it may be another instance of this failure after the most recent attempt.

freebsd-386-12_0 at 9ceb1e5f5caca5666f9db50864c45ca1f88da1df
…

##### ../misc/reboot
Building Go cmd/dist using /tmp/workdir/go. (devel 9ceb1e5f5caca5666f9db50864c45ca1f88da1df freebsd/386)
Building Go toolchain1 using /tmp/workdir/go.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for freebsd/386.
# crypto/tls
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x862b4c0]

goroutine 38 [running]:
cmd/compile/internal/gc.buildssa(0x0, 0x2, 0x0)
	/tmp/workdir/tmp/reboot-goroot082852438/src/cmd/compile/internal/gc/ssa.go:297 +0xb0
cmd/compile/internal/gc.compileSSA(0x0, 0x2)
	/tmp/workdir/tmp/reboot-goroot082852438/src/cmd/compile/internal/gc/pgen.go:296 +0x4c
cmd/compile/internal/gc.compileFunctions.func2(0x3a59ccc0, 0x39016c70, 0x2)
	/tmp/workdir/tmp/reboot-goroot082852438/src/cmd/compile/internal/gc/pgen.go:361 +0x35
created by cmd/compile/internal/gc.compileFunctions
	/tmp/workdir/tmp/reboot-goroot082852438/src/cmd/compile/internal/gc/pgen.go:359 +0xf7
go tool dist: FAILED: /tmp/workdir/tmp/reboot-goroot082852438/pkg/tool/freebsd_386/go_bootstrap install -gcflags=all= -ldflags=all= std cmd: exit status 2
--- FAIL: TestRepeatBootstrap (98.43s)
    reboot_test.go:50: exit status 2

@cherrymui
Copy link
Member

This looks somewhat different. The original failure looks like the GC found (temporary) bad pointers. The new one is a segfault.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/226437 mentions this issue: cmd/compile: fix ephemeral pointer problem on amd64

gopherbot pushed a commit that referenced this issue Mar 30, 2020
Make sure we don't use the rewrite ptr + (c + x) -> c + (ptr + x), as
that may create an ephemeral out-of-bounds pointer.

I have not seen an actual bug caused by this yet, but we've seen
them in the 386 port so I'm fixing this issue for amd64 as well.

The load-combining rules needed to be reworked somewhat to still
work without the above broken rule.

Update #37881

Change-Id: I8046d170e89e2035195f261535e34ca7d8aca68a
Reviewed-on: https://go-review.googlesource.com/c/go/+/226437
Run-TryBot: Keith Randall <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Cherry Zhang <[email protected]>
@randall77
Copy link
Contributor

I'm going to declare this fixed. The original CL with fixed rules is in, both for 386 and amd64.
I'm not seeing any similar failures on the dashboard since the fixes went in.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

9 participants