Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make make _test.pkg.bft consistent, especially on CI (tm2/test) #1320

Closed
moul opened this issue Oct 30, 2023 · 15 comments
Closed

make make _test.pkg.bft consistent, especially on CI (tm2/test) #1320

moul opened this issue Oct 30, 2023 · 15 comments

Comments

@moul
Copy link
Member

moul commented Oct 30, 2023

This is the test with the highest flappiness/flakiness.

Related with #202

thehowl pushed a commit that referenced this issue Nov 7, 2023
It should prevent certain flaky tests from failing on GH.
In any case, it increases the high-level timeout to be greater than the
timeout for running 'go test'.

Related to issue #1320.
gfanton pushed a commit to gfanton/gno that referenced this issue Nov 9, 2023
It should prevent certain flaky tests from failing on GH.
In any case, it increases the high-level timeout to be greater than the
timeout for running 'go test'.

Related to issue gnolang#1320.
moul added a commit to moul/gno that referenced this issue Nov 14, 2023
It should prevent certain flaky tests from failing on GH.
In any case, it increases the high-level timeout to be greater than the
timeout for running 'go test'.

Related to issue gnolang#1320.
@gfanton
Copy link
Member

gfanton commented Nov 16, 2023

I've extracted two stack traces from two different CI logs.

stacktrace 1

2023-11-15T01:48:36.1098642Z panic: test timed out after 20m0s
2023-11-15T01:48:36.1099607Z running tests:
2023-11-15T01:48:36.1107254Z 	TestSetValidBlockOnDelayedPrevote (19m39s)
2023-11-15T01:48:36.1108122Z 	TestStateProposerSelection0 (19m39s)
2023-11-15T01:48:36.1108524Z 
2023-11-15T01:48:36.1108655Z goroutine 27498 [running]:
2023-11-15T01:48:36.1109331Z testing.(*M).startAlarm.func1()
2023-11-15T01:48:36.1109932Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:2259 +0x3b9
2023-11-15T01:48:36.1110489Z created by time.goFunc
2023-11-15T01:48:36.1110948Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/time/sleep.go:176 +0x2d
2023-11-15T01:48:36.1111344Z 
2023-11-15T01:48:36.1111493Z goroutine 1 [chan receive, 18 minutes]:
2023-11-15T01:48:36.1111895Z testing.tRunner.func1()
2023-11-15T01:48:36.1112411Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1561 +0x489
2023-11-15T01:48:36.1112989Z testing.tRunner(0xc0001851e0, 0xc00019bbb8)
2023-11-15T01:48:36.1113566Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1601 +0x138
2023-11-15T01:48:36.1114269Z testing.runTests(0xc0001e1040?, {0xe3e840, 0x38, 0x38}, {0xc00022a4c0?, 0xc00022a480?, 0xe44980?})
2023-11-15T01:48:36.1114990Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:2052 +0x445
2023-11-15T01:48:36.1115467Z testing.(*M).Run(0xc0001e1040)
2023-11-15T01:48:36.1116047Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1925 +0x636
2023-11-15T01:48:36.1116761Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestMain(0x9d6e6d?)
2023-11-15T01:48:36.1117440Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/replay_test.go:44 +0xf4
2023-11-15T01:48:36.1117943Z main.main()
2023-11-15T01:48:36.1118196Z 	_testmain.go:193 +0x1e6
2023-11-15T01:48:36.1118378Z 
2023-11-15T01:48:36.1118502Z goroutine 18 [syscall, 19 minutes]:
2023-11-15T01:48:36.1118831Z os/signal.signal_recv()
2023-11-15T01:48:36.1119275Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sigqueue.go:152 +0x29
2023-11-15T01:48:36.1119733Z os/signal.loop()
2023-11-15T01:48:36.1120168Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/os/signal/signal_unix.go:23 +0x13
2023-11-15T01:48:36.1120724Z created by os/signal.Notify.func1.1 in goroutine 8
2023-11-15T01:48:36.1121284Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/os/signal/signal.go:151 +0x1f
2023-11-15T01:48:36.1121657Z 
2023-11-15T01:48:36.1121830Z goroutine 53 [sync.RWMutex.RLock, 19 minutes]:
2023-11-15T01:48:36.1122391Z sync.runtime_SemacquireRWMutexR(0xaaf460?, 0x60?, 0xa?)
2023-11-15T01:48:36.1123096Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sema.go:82 +0x25
2023-11-15T01:48:36.1123663Z sync.(*RWMutex).RLock(...)
2023-11-15T01:48:36.1124079Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/sync/rwmutex.go:71
2023-11-15T01:48:36.1124784Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).GetRoundState(0xc00017c000)
2023-11-15T01:48:36.1125976Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:228 +0x78
2023-11-15T01:48:36.1126740Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestStateProposerSelection0(0xc0002c5380)
2023-11-15T01:48:36.1127702Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state_test.go:84 +0x23e
2023-11-15T01:48:36.1128237Z testing.tRunner(0xc0002c5380, 0xa19c80)
2023-11-15T01:48:36.1128744Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1595 +0xff
2023-11-15T01:48:36.1129251Z created by testing.(*T).Run in goroutine 1
2023-11-15T01:48:36.1129779Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1648 +0x3ad
2023-11-15T01:48:36.1130148Z 
2023-11-15T01:48:36.1130307Z goroutine 67 [sync.RWMutex.RLock, 19 minutes]:
2023-11-15T01:48:36.1130773Z sync.runtime_SemacquireRWMutexR(0x20?, 0x1?, 0xc00002ac40?)
2023-11-15T01:48:36.1131333Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sema.go:82 +0x25
2023-11-15T01:48:36.1131789Z sync.(*RWMutex).RLock(...)
2023-11-15T01:48:36.1132186Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/sync/rwmutex.go:71
2023-11-15T01:48:36.1132864Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).GetRoundState(0xc0000fa000)
2023-11-15T01:48:36.1133644Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:228 +0x78
2023-11-15T01:48:36.1134434Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestSetValidBlockOnDelayedPrevote(0xc0002e0000?)
2023-11-15T01:48:36.1135275Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state_test.go:1093 +0x458
2023-11-15T01:48:36.1135818Z testing.tRunner(0xc0002e0340, 0xa19c00)
2023-11-15T01:48:36.1136444Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1595 +0xff
2023-11-15T01:48:36.1136948Z created by testing.(*T).Run in goroutine 1
2023-11-15T01:48:36.1137471Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1648 +0x3ad
2023-11-15T01:48:36.1137840Z 
2023-11-15T01:48:36.1137974Z goroutine 101 [chan receive, 19 minutes]:
2023-11-15T01:48:36.1138536Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler.func1()
2023-11-15T01:48:36.1139249Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:17 +0x47
2023-11-15T01:48:36.1140031Z created by github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler in goroutine 100
2023-11-15T01:48:36.1141070Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:16 +0xc5
2023-11-15T01:48:36.1141451Z 
2023-11-15T01:48:36.1141556Z goroutine 102 [select]:
2023-11-15T01:48:36.1142054Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).closeFileRoutine(0xc000298d20)
2023-11-15T01:48:36.1142865Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:93 +0x65
2023-11-15T01:48:36.1143607Z created by github.com/gnolang/gno/tm2/pkg/autofile.OpenAutoFile in goroutine 100
2023-11-15T01:48:36.1144297Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:75 +0x196
2023-11-15T01:48:36.1144660Z 
2023-11-15T01:48:36.1144763Z goroutine 103 [select]:
2023-11-15T01:48:36.1145260Z github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).processFlushTicks(0xc0001e80e0)
2023-11-15T01:48:36.1145921Z 	/home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:152 +0x5d
2023-11-15T01:48:36.1146584Z created by github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).OnStart in goroutine 100
2023-11-15T01:48:36.1147260Z 	/home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:146 +0xbf
2023-11-15T01:48:36.1147588Z 
2023-11-15T01:48:36.1147727Z goroutine 85 [chan receive, 19 minutes]:
2023-11-15T01:48:36.1148438Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler.func1()
2023-11-15T01:48:36.1149439Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:17 +0x47
2023-11-15T01:48:36.1150225Z created by github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler in goroutine 84
2023-11-15T01:48:36.1151018Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:16 +0xc5
2023-11-15T01:48:36.1151481Z 
2023-11-15T01:48:36.1151592Z goroutine 86 [select]:
2023-11-15T01:48:36.1152087Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).closeFileRoutine(0xc000330910)
2023-11-15T01:48:36.1152799Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:93 +0x65
2023-11-15T01:48:36.1153476Z created by github.com/gnolang/gno/tm2/pkg/autofile.OpenAutoFile in goroutine 84
2023-11-15T01:48:36.1154596Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:75 +0x196
2023-11-15T01:48:36.1154964Z 
2023-11-15T01:48:36.1155092Z goroutine 104 [select, 19 minutes]:
2023-11-15T01:48:36.1155769Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).timeoutRoutine(0xc000318660)
2023-11-15T01:48:36.1156646Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:96 +0x18a
2023-11-15T01:48:36.1157533Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).OnStart in goroutine 100
2023-11-15T01:48:36.1158456Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:51 +0x8e
2023-11-15T01:48:36.1158903Z 
2023-11-15T01:48:36.1159030Z goroutine 105 [select, 10 minutes]:
2023-11-15T01:48:36.1159819Z github.com/gnolang/gno/tm2/pkg/events.SubscribeFiltered.SubscribeFilteredOn.func1({0x95ce80, 0xc0001e0000})
2023-11-15T01:48:36.1160751Z 	/home/runner/work/gno/gno/tm2/pkg/events/subscribe.go:52 +0x279
2023-11-15T01:48:36.1162007Z github.com/gnolang/gno/tm2/pkg/events.(*eventSwitch).FireEvent(0xc0001e8070, {0x95ce80, 0xc0001e0000})
2023-11-15T01:48:36.1162962Z 	/home/runner/work/gno/gno/tm2/pkg/events/events.go:107 +0xe6
2023-11-15T01:48:36.1163896Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).addVote(0xc0000fa000, 0xc0001e0000, {0x0, 0x0})
2023-11-15T01:48:36.1164981Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1559 +0x6ec
2023-11-15T01:48:36.1166018Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).tryAddVote(0xc0000fa000, 0xf?, {0x0?, 0x0?})
2023-11-15T01:48:36.1166838Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1483 +0x5c
2023-11-15T01:48:36.1167735Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc0000fa000, {{0xaaaf60, 0xc0003161f0}, {0x0, 0x0}})
2023-11-15T01:48:36.1168747Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:691 +0x3a5
2023-11-15T01:48:36.1169551Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc0000fa000, 0x0)
2023-11-15T01:48:36.1170359Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:650 +0x4e6
2023-11-15T01:48:36.1171157Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart in goroutine 100
2023-11-15T01:48:36.1172047Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:344 +0x565
2023-11-15T01:48:36.1172431Z 
2023-11-15T01:48:36.1172635Z goroutine 87 [select]:
2023-11-15T01:48:36.1173202Z github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).processFlushTicks(0xc00012a1c0)
2023-11-15T01:48:36.1173876Z 	/home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:152 +0x5d
2023-11-15T01:48:36.1174535Z created by github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).OnStart in goroutine 84
2023-11-15T01:48:36.1175200Z 	/home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:146 +0xbf
2023-11-15T01:48:36.1175515Z 
2023-11-15T01:48:36.1175634Z goroutine 88 [select, 19 minutes]:
2023-11-15T01:48:36.1176228Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).timeoutRoutine(0xc00009e420)
2023-11-15T01:48:36.1177000Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:96 +0x18a
2023-11-15T01:48:36.1177773Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).OnStart in goroutine 84
2023-11-15T01:48:36.1178558Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:51 +0x8e
2023-11-15T01:48:36.1179037Z 
2023-11-15T01:48:36.1179168Z goroutine 89 [select, 10 minutes]:
2023-11-15T01:48:36.1179877Z github.com/gnolang/gno/tm2/pkg/events.SubscribeToEventOn.SubscribeFilteredOn.func2({0x99d440, 0xc000330af0})
2023-11-15T01:48:36.1180898Z 	/home/runner/work/gno/gno/tm2/pkg/events/subscribe.go:52 +0x279
2023-11-15T01:48:36.1181656Z github.com/gnolang/gno/tm2/pkg/events.(*eventSwitch).FireEvent(0xc00012a150, {0x99d440, 0xc000330af0})
2023-11-15T01:48:36.1182395Z 	/home/runner/work/gno/gno/tm2/pkg/events/events.go:107 +0xe6
2023-11-15T01:48:36.1183229Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).addProposalBlockPart(0xc00017c000, 0x20?, {0x0?, 0x0?})
2023-11-15T01:48:36.1184225Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1446 +0x456
2023-11-15T01:48:36.1185133Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc00017c000, {{0xaaaf80, 0xc00031aab0}, {0x0, 0x0}})
2023-11-15T01:48:36.1186036Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:679 +0x16f
2023-11-15T01:48:36.1186819Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc00017c000, 0x0)
2023-11-15T01:48:36.1187607Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:650 +0x4e6
2023-11-15T01:48:36.1188381Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart in goroutine 84
2023-11-15T01:48:36.1189171Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:344 +0x565
2023-11-15T01:48:36.1189797Z FAIL	github.com/gnolang/gno/tm2/pkg/bft/consensus	1200.017s
2023-11-15T01:48:36.1190394Z ?   	github.com/gnolang/gno/tm2/pkg/bft/consensus/config	[no test files]
2023-11-15T01:48:36.4972987Z === RUN   TestPeerCatchupRounds
2023-11-15T01:48:36.4973521Z === PAUSE TestPeerCatchupRounds
2023-11-15T01:48:36.4973866Z === CONT  TestPeerCatchupRounds
2023-11-15T01:48:36.4995471Z --- PASS: TestPeerCatchupRounds (0.00s)
2023-11-15T01:48:36.4995826Z PASS
2023-11-15T01:48:36.4999711Z coverage: 21.7% of statements
2023-11-15T01:48:36.5012639Z ok  	github.com/gnolang/gno/tm2/pkg/bft/consensus/types	0.009s	coverage: 21.7% of statements
2023-11-15T01:48:36.5031561Z ?   	github.com/gnolang/gno/tm2/pkg/bft/fail	[no test files]

stacktrace 2

2023-11-15T10:41:50.0083327Z panic: test timed out after 20m0s
2023-11-15T10:41:50.0084063Z running tests:
2023-11-15T10:41:50.0084729Z 	TestSetValidBlockOnDelayedPrevote (19m39s)
2023-11-15T10:41:50.0085339Z 
2023-11-15T10:41:50.0085566Z goroutine 29115 [running]:
2023-11-15T10:41:50.0086193Z testing.(*M).startAlarm.func1()
2023-11-15T10:41:50.0088611Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:2259 +0x3b9
2023-11-15T10:41:50.0089187Z created by time.goFunc
2023-11-15T10:41:50.0089609Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/time/sleep.go:176 +0x2d
2023-11-15T10:41:50.0089964Z 
2023-11-15T10:41:50.0090104Z goroutine 1 [chan receive, 19 minutes]:
2023-11-15T10:41:50.0090463Z testing.tRunner.func1()
2023-11-15T10:41:50.0090932Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1561 +0x489
2023-11-15T10:41:50.0091841Z testing.tRunner(0xc000167380, 0xc00019bbb8)
2023-11-15T10:41:50.0092375Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1601 +0x138
2023-11-15T10:41:50.0093090Z testing.runTests(0xc0001e1040?, {0xe3e840, 0x38, 0x38}, {0xc00022a4c0?, 0xc00022a480?, 0xe44980?})
2023-11-15T10:41:50.0093814Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:2052 +0x445
2023-11-15T10:41:50.0094294Z testing.(*M).Run(0xc0001e1040)
2023-11-15T10:41:50.0094757Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1925 +0x636
2023-11-15T10:41:50.0095365Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestMain(0x9d6e6d?)
2023-11-15T10:41:50.0096032Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/replay_test.go:44 +0xf4
2023-11-15T10:41:50.0096532Z main.main()
2023-11-15T10:41:50.0096786Z 	_testmain.go:193 +0x1e6
2023-11-15T10:41:50.0096964Z 
2023-11-15T10:41:50.0097091Z goroutine 10 [syscall, 19 minutes]:
2023-11-15T10:41:50.0097420Z os/signal.signal_recv()
2023-11-15T10:41:50.0097863Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sigqueue.go:152 +0x29
2023-11-15T10:41:50.0098328Z os/signal.loop()
2023-11-15T10:41:50.0098762Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/os/signal/signal_unix.go:23 +0x13
2023-11-15T10:41:50.0099311Z created by os/signal.Notify.func1.1 in goroutine 8
2023-11-15T10:41:50.0099863Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/os/signal/signal.go:151 +0x1f
2023-11-15T10:41:50.0100234Z 
2023-11-15T10:41:50.0100390Z goroutine 57 [sync.RWMutex.RLock, 19 minutes]:
2023-11-15T10:41:50.0100858Z sync.runtime_SemacquireRWMutexR(0x20?, 0x1?, 0xc000304480?)
2023-11-15T10:41:50.0101416Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/runtime/sema.go:82 +0x25
2023-11-15T10:41:50.0101858Z sync.(*RWMutex).RLock(...)
2023-11-15T10:41:50.0102254Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/sync/rwmutex.go:71
2023-11-15T10:41:50.0102947Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).GetRoundState(0xc000004300)
2023-11-15T10:41:50.0103755Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:228 +0x78
2023-11-15T10:41:50.0104663Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestSetValidBlockOnDelayedPrevote(0xc00048aea0?)
2023-11-15T10:41:50.0105521Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state_test.go:1093 +0x458
2023-11-15T10:41:50.0106068Z testing.tRunner(0xc00048b1e0, 0xa19c00)
2023-11-15T10:41:50.0106570Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1595 +0xff
2023-11-15T10:41:50.0107073Z created by testing.(*T).Run in goroutine 1
2023-11-15T10:41:50.0107804Z 	/opt/hostedtoolcache/go/1.21.3/x64/src/testing/testing.go:1648 +0x3ad
2023-11-15T10:41:50.0108411Z 
2023-11-15T10:41:50.0108643Z goroutine 101 [chan receive, 19 minutes]:
2023-11-15T10:41:50.0109628Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler.func1()
2023-11-15T10:41:50.0110868Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:17 +0x47
2023-11-15T10:41:50.0112264Z created by github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler in goroutine 100
2023-11-15T10:41:50.0114061Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:16 +0xc5
2023-11-15T10:41:50.0114757Z 
2023-11-15T10:41:50.0115084Z goroutine 102 [select]:
2023-11-15T10:41:50.0115971Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).closeFileRoutine(0xc0002a4320)
2023-11-15T10:41:50.0117203Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:93 +0x65
2023-11-15T10:41:50.0118350Z created by github.com/gnolang/gno/tm2/pkg/autofile.OpenAutoFile in goroutine 100
2023-11-15T10:41:50.0119575Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:75 +0x196
2023-11-15T10:41:50.0120200Z 
2023-11-15T10:41:50.0120383Z goroutine 103 [select]:
2023-11-15T10:41:50.0121228Z github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).processFlushTicks(0xc0001e80e0)
2023-11-15T10:41:50.0122329Z 	/home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:152 +0x5d
2023-11-15T10:41:50.0123477Z created by github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).OnStart in goroutine 100
2023-11-15T10:41:50.0124784Z 	/home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:146 +0xbf
2023-11-15T10:41:50.0125358Z 
2023-11-15T10:41:50.0125571Z goroutine 104 [select, 19 minutes]:
2023-11-15T10:41:50.0126610Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).timeoutRoutine(0xc0002fc900)
2023-11-15T10:41:50.0127965Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:96 +0x18a
2023-11-15T10:41:50.0129349Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).OnStart in goroutine 100
2023-11-15T10:41:50.0130734Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:51 +0x8e
2023-11-15T10:41:50.0131369Z 
2023-11-15T10:41:50.0131553Z goroutine 105 [select, 10 minutes]:
2023-11-15T10:41:50.0132707Z github.com/gnolang/gno/tm2/pkg/events.SubscribeFiltered.SubscribeFilteredOn.func1({0x95ce80, 0xc00018c000})
2023-11-15T10:41:50.0134178Z 	/home/runner/work/gno/gno/tm2/pkg/events/subscribe.go:52 +0x279
2023-11-15T10:41:50.0135504Z github.com/gnolang/gno/tm2/pkg/events.(*eventSwitch).FireEvent(0xc0001e8070, {0x95ce80, 0xc00018c000})
2023-11-15T10:41:50.0136829Z 	/home/runner/work/gno/gno/tm2/pkg/events/events.go:107 +0xe6
2023-11-15T10:41:50.0138226Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).addVote(0xc000004300, 0xc00018c000, {0x0, 0x0})
2023-11-15T10:41:50.0139715Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1559 +0x6ec
2023-11-15T10:41:50.0141215Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).tryAddVote(0xc000004300, 0xf?, {0x0?, 0x0?})
2023-11-15T10:41:50.0142694Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1483 +0x5c
2023-11-15T10:41:50.0144263Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc000004300, {{0xaaaf60, 0xc0000b21e0}, {0x0, 0x0}})
2023-11-15T10:41:50.0145901Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:691 +0x3a5
2023-11-15T10:41:50.0146959Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc000004300, 0x0)
2023-11-15T10:41:50.0147766Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:650 +0x4e6
2023-11-15T10:41:50.0148945Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart in goroutine 100
2023-11-15T10:41:50.0149760Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:344 +0x565
2023-11-15T10:41:50.0150353Z FAIL	github.com/gnolang/gno/tm2/pkg/bft/consensus	1200.017s
2023-11-15T10:41:50.0150959Z ?   	github.com/gnolang/gno/tm2/pkg/bft/consensus/config	[no test files]

From these traces, we can deduce that there is a deadlock that can appear in multiple tests from state_test.go:

  • TestSetValidBlockOnDelayedPrevote
  • TestStateProposerSelection0
  • ....

The first conflicting lock appears when calling ConsensusState.GetRoundState(), which in turn calls ConsensusState.mtx.RLock().

The second lock appears in ConsensusState.handleMsg. After locking ConsensusState.mtx, the method attempts to add a vote, ultimately reaching the eventSwitch.FireEvent method. This method will block because there are no consumers for the event, and as a result, the mutex is never unlocked.

example on TestSetValidBlockOnDelayedPrevote:

func TestSetValidBlockOnDelayedPrevote(t *testing.T) {
	t.Parallel()

	cs1, vss := randConsensusState(4)
	vs2, vs3, vs4 := vss[1], vss[2], vss[3]
	height, round := cs1.Height, cs1.Round

	partSize := types.BlockPartSizeBytes

	proposalCh := subscribe(cs1.evsw, cstypes.EventCompleteProposal{})
	timeoutWaitCh := subscribe(cs1.evsw, cstypes.EventTimeoutWait{})
	newRoundCh := subscribe(cs1.evsw, cstypes.EventNewRound{})
	validBlockCh := subscribe(cs1.evsw, cstypes.EventNewValidBlock{})
	addr := cs1.privValidator.GetPubKey().Address()

	/// A vote subscriber is created here, which ultimately receive an event that is never consumed by this test.

	voteCh := subscribeToVoter(cs1, addr)

	// ... more code

	ensureNewValidBlock(validBlockCh, height, round)

	/// deadlock here
	rs = cs1.GetRoundState()

	assert.True(t, bytes.Equal(rs.ValidBlock.Hash(), propBlockHash))
	assert.True(t, rs.ValidBlockParts.Header().Equals(propBlockParts.Header()))
	assert.True(t, rs.ValidRound == round)
}

moul added a commit that referenced this issue Dec 7, 2023
ref #1320 

This draft PR demonstrates a way for fixing flaky BFT tests. 
As mentioned in #1320 the main issue is that some channels are not fully
consumed, leading to deadlocks. By using buffered channels as proxy in
state tests, we can easily avoid these deadlocks. However, I don't think
this is the best approach, as these channels were originally set to zero
(capacity) for good reasons, but perhaps it could make sense for testing
purposes.
In any case, if anyone has a better idea for a solution, your
suggestions are welcome.

<details><summary>Contributors' checklist...</summary>

- [ ] Added new tests, or not needed, or not feasible
- [ ] Provided an example (e.g. screenshot) to aid review or the PR is
self-explanatory
- [ ] Updated the official documentation or not needed
- [ ] No breaking changes were made, or a `BREAKING CHANGE: xxx` message
was included in the description
- [ ] Added references to related issues and PRs
- [ ] Provided any useful hints for running manual tests
- [ ] Added new benchmarks to [generated
graphs](https://gnoland.github.io/benchmarks), if any. More info
[here](https://github.com/gnolang/gno/blob/master/.benchmarks/README.md).
</details>

---------

Signed-off-by: gfanton <[email protected]>
Co-authored-by: Manfred Touron <[email protected]>
@jaekwon
Copy link
Contributor

jaekwon commented Dec 15, 2023

After locking ConsensusState.mtx, the method attempts to add a vote, ultimately reaching the eventSwitch.FireEvent method. This method will block because there are no consumers for the event, and as a result, the mutex is never unlocked.

The most explicit solution is to consume the event, synchronously. And with more events we may want to create helpers to consume events, but always synchronously/deterministically.

So this should be reverted https://github.com/gnolang/gno/pull/1385/files as it masks opportunities to do the above. Eventually solutions like https://github.com/gnolang/gno/pull/1385/files will start to fail in unexpected ways (when the buffer becomes full at high workload).

  • we don't want to use buffered event channels there; the 0 capacity helps uncover issues.
  • our tests should just be more explicit and consume events, or explicitly ignore them.
  • increasing buffer capacity is just sweeping the problem under the rug.

If you can't figure out the root cause then I can work on it, but in any case we should revert 1385 first.

@moul
Copy link
Member Author

moul commented Dec 15, 2023

@jaekwon, @gfanton has opened #1441 to revert the PR. I request you to keep it open instead of merging. The situation was causing a blockage for everyone, so I prefer that we handle this case gradually and effectively on the side.

I promise that we will not launch mainnet without resolving the underlying problem.

Let's discuss this further during our next call.

@petar-dambovaliev, could you please add this low-level problem to your list of topics to review? I think it would be a great fit for you.

@moul moul moved this to 🚀 Needed for Launch in 🚀 The Launch [DEPRECATED] Dec 15, 2023
moul pushed a commit that referenced this issue Dec 17, 2023
This PR reverts #1385 following @jaekwon
#1320 (comment)

<!-- please provide a detailed description of the changes made in this
pull request. -->

<details><summary>Contributors' checklist...</summary>

- [ ] Added new tests, or not needed, or not feasible
- [ ] Provided an example (e.g. screenshot) to aid review or the PR is
self-explanatory
- [ ] Updated the official documentation or not needed
- [ ] No breaking changes were made, or a `BREAKING CHANGE: xxx` message
was included in the description
- [ ] Added references to related issues and PRs
- [ ] Provided any useful hints for running manual tests
- [ ] Added new benchmarks to [generated
graphs](https://gnoland.github.io/benchmarks), if any. More info
[here](https://github.com/gnolang/gno/blob/master/.benchmarks/README.md).
</details>

Signed-off-by: gfanton <[email protected]>
@moul
Copy link
Member Author

moul commented Dec 17, 2023

As mentioned here, I have reverted the previous hack.

Now, our priority should be to fix it instead of allowing it to grow over time. Let's seek help from @jaekwon and discuss it during one of the upcoming team meetings.

@petar-dambovaliev, could you take a look in the next few days? I have a feeling that you can identify the issue quickly with your fast debugging skills.

We need to determine the cause of the ratio increase in the failing tests. Did we introduce it by increasing parallelism or is it something else?

Feel free to share tips on the fastest way to reproduce without running the entire lengthy test suite.

cc @zivkovicmilos @gfanton.

@moul moul added this to TM2/GnoVM Dec 17, 2023
@github-project-automation github-project-automation bot moved this to To triage in TM2/GnoVM Dec 17, 2023
@gfanton
Copy link
Member

gfanton commented Dec 17, 2023

I will also try to give another look, but let's make sure to have a review from @jaekwon this time

@piux2
Copy link
Contributor

piux2 commented Dec 20, 2023

@gfanton Starting from which commit/version did we have this issue on CI? If it did not appear in an earlier version, some change must have caused the testing case to fail.

@gfanton
Copy link
Member

gfanton commented Dec 20, 2023

@piux2 I didn't attempt to do a bisect; that was on my to-do list. However, since I'm not able to reproduce the bug locally, I want to focus on that first. Based on what others are saying, it seems like this test has been failing for a long time.

@piux2
Copy link
Contributor

piux2 commented Dec 21, 2023

@gfanton

I reviewed the error message and logs, I doubt the testing behavior are rooted in pkg/bft test itself.

https://github.com/gnolang/gno/actions/runs/6918163494/job/18820116982?pr=1309

The panic error message in the trace above is from following code.
go test -v -p 1 -timeout=20m -coverprofile=coverage.out -covermode=atomic ./pkg/bft/..

"test timed out after 20m0s"

panic("test timed out after " + cfg.timeout.String())

Basically it panics if the entire CI suite can not complete in 20mins.

If we look at the start time and end time of pkg/bft CI log. It is about 20 mins.

root cause

There is another deeper problem. How could a cmd/gno/test.go leak the error message to tm2/pkg/bft test log and take the input from tm2 test?

It turns out the commands.IO passed in here wrapping a global os.Stdout and os.Stdin which could cause the issue because all go programs share the same os.Stdin and os.Stdout.

func newTestCmd(io commands.IO) *commands.Command {

Usually Golang os.Stdin and os.Stdout have separate *os.File file descriptor in its own process space. However, if all these instances are deployed on the same docker container, they could all reading from and writing to the same container’s stdin / stidout stream.

the workflow running concurrently on docker image ghcr.io/gnolang/gno

group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}

WDYT?

@gfanton
Copy link
Member

gfanton commented Dec 21, 2023

@piux2

I reviewed the error message and logs, I doubt the testing behavior are rooted in pkg/bft test itself.

gnolang/gno/actions/runs/6918163494/job/18820116982?pr=1309

This is definitely coming from bft state_test.go, if you check the log from the PR you sent and extract the panic stack trace you get this:

2023-11-19T03:46:40.2211188Z 2023/11/19 03:46:40 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 40s
2023-11-19T03:48:00.2219207Z 2023/11/19 03:48:00 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 1m20s
2023-11-19T03:50:40.2222972Z 2023/11/19 03:50:40 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 2m40s
2023-11-19T03:56:00.2225112Z 2023/11/19 03:56:00 [WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) B9BDB6E2B856 7B92917CA6E6 @ 2023-11-19T03:45:30.216887417Z}} for 5m20s
2023-11-19T04:05:06.4032211Z 	github.com/gnolang/gno/tm2/pkg/bft/consensus	coverage: 73.4% of statements
2023-11-19T04:05:06.4086930Z panic: test timed out after 20m0s
2023-11-19T04:05:06.4087896Z running tests:
2023-11-19T04:05:06.4088502Z 	TestProposeValidBlock (19m37s)
2023-11-19T04:05:06.4088911Z 
2023-11-19T04:05:06.4089117Z goroutine 30501 [running]:
2023-11-19T04:05:06.4089671Z testing.(*M).startAlarm.func1()
2023-11-19T04:05:06.4090605Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:2241 +0x3c5
2023-11-19T04:05:06.4091492Z created by time.goFunc
2023-11-19T04:05:06.4092178Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/time/sleep.go:176 +0x32
2023-11-19T04:05:06.4092804Z 
2023-11-19T04:05:06.4093032Z goroutine 1 [chan receive, 19 minutes]:
2023-11-19T04:05:06.4093611Z testing.tRunner.func1()
2023-11-19T04:05:06.4094430Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1542 +0x4a5
2023-11-19T04:05:06.4095442Z testing.tRunner(0xc00016b1e0, 0xc000187bf8)
2023-11-19T04:05:06.4096457Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1582 +0x144
2023-11-19T04:05:06.4097915Z testing.runTests(0xc0001df4a0?, {0xe414e0, 0x38, 0x38}, {0xc0002285c0?, 0xc000228580?, 0xe48660?})
2023-11-19T04:05:06.4099383Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:2034 +0x489
2023-11-19T04:05:06.4100420Z testing.(*M).Run(0xc0001df4a0)
2023-11-19T04:05:06.4101298Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1906 +0x63a
2023-11-19T04:05:06.4102478Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestMain(0x9f64f2?)
2023-11-19T04:05:06.4103668Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/replay_test.go:44 +0xe5
2023-11-19T04:05:06.4104481Z main.main()
2023-11-19T04:05:06.4104881Z 	_testmain.go:190 +0x1ef
2023-11-19T04:05:06.4105162Z 
2023-11-19T04:05:06.4105361Z goroutine 18 [syscall, 19 minutes]:
2023-11-19T04:05:06.4105888Z os/signal.signal_recv()
2023-11-19T04:05:06.4106603Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/runtime/sigqueue.go:152 +0x2f
2023-11-19T04:05:06.4107745Z os/signal.loop()
2023-11-19T04:05:06.4108470Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/os/signal/signal_unix.go:23 +0x19
2023-11-19T04:05:06.4109301Z created by os/signal.Notify.func1.1
2023-11-19T04:05:06.4110127Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/os/signal/signal.go:151 +0x2a
2023-11-19T04:05:06.4110747Z 
2023-11-19T04:05:06.4110969Z goroutine 20836 [chan receive, 19 minutes]:
2023-11-19T04:05:06.4111930Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler.func1()
2023-11-19T04:05:06.4113176Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:17 +0x4d
2023-11-19T04:05:06.4114317Z created by github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).setupCloseHandler
2023-11-19T04:05:06.4115505Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile_unix.go:16 +0xca
2023-11-19T04:05:06.4116290Z 
2023-11-19T04:05:06.4116550Z goroutine 66 [sync.RWMutex.RLock, 19 minutes]:
2023-11-19T04:05:06.4117353Z sync.runtime_SemacquireRWMutexR(0x1?, 0x80?, 0x20?)
2023-11-19T04:05:06.4118294Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/runtime/sema.go:82 +0x26
2023-11-19T04:05:06.4119077Z sync.(*RWMutex).RLock(...)
2023-11-19T04:05:06.4119766Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/sync/rwmutex.go:71
2023-11-19T04:05:06.4120981Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).GetRoundState(0xc000004f00)
2023-11-19T04:05:06.4122302Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:228 +0x7e
2023-11-19T04:05:06.4123666Z github.com/gnolang/gno/tm2/pkg/bft/consensus.TestProposeValidBlock(0xc0002de340)
2023-11-19T04:05:06.4125143Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state_test.go:1059 +0xca5
2023-11-19T04:05:06.4126051Z testing.tRunner(0xc0002de340, 0xa3a478)
2023-11-19T04:05:06.4126974Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1576 +0x10b
2023-11-19T04:05:06.4127805Z created by testing.(*T).Run
2023-11-19T04:05:06.4128612Z 	/opt/hostedtoolcache/go/1.20.10/x64/src/testing/testing.go:1629 +0x3ea
2023-11-19T04:05:06.4129278Z 
2023-11-19T04:05:06.4129499Z goroutine 20840 [select, 11 minutes]:
2023-11-19T04:05:06.4130542Z github.com/gnolang/gno/tm2/pkg/events.SubscribeFilteredOn.func1({0x97f900, 0xc0050712c0})
2023-11-19T04:05:06.4131964Z 	/home/runner/work/gno/gno/tm2/pkg/events/subscribe.go:52 +0x28a
2023-11-19T04:05:06.4133329Z github.com/gnolang/gno/tm2/pkg/events.(*eventSwitch).FireEvent(0xc002348c40, {0x97f900, 0xc0050712c0})
2023-11-19T04:05:06.4134655Z 	/home/runner/work/gno/gno/tm2/pkg/events/events.go:107 +0xec
2023-11-19T04:05:06.4136059Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).addVote(0xc000004f00, 0xc0050712c0, {0x0, 0x0})
2023-11-19T04:05:06.4137565Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1559 +0x777
2023-11-19T04:05:06.4139029Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).tryAddVote(0xc000004f00, 0xf?, {0x0?, 0x0?})
2023-11-19T04:05:06.4140505Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:1483 +0x65
2023-11-19T04:05:06.4142118Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).handleMsg(0xc000004f00, {{0xac72e0, 0xc0090c1580}, {0x0, 0x0}})
2023-11-19T04:05:06.4143744Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:691 +0x3c5
2023-11-19T04:05:06.4145110Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).receiveRoutine(0xc000004f00, 0x0)
2023-11-19T04:05:06.4146502Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:650 +0x4e6
2023-11-19T04:05:06.4147921Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*ConsensusState).OnStart
2023-11-19T04:05:06.4149209Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/state.go:344 +0x598
2023-11-19T04:05:06.4149886Z 
2023-11-19T04:05:06.4150070Z goroutine 20838 [select]:
2023-11-19T04:05:06.4150950Z github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).processFlushTicks(0xc000a259d0)
2023-11-19T04:05:06.4152124Z 	/home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:152 +0x65
2023-11-19T04:05:06.4153149Z created by github.com/gnolang/gno/tm2/pkg/bft/wal.(*baseWAL).OnStart
2023-11-19T04:05:06.4154366Z 	/home/runner/work/gno/gno/tm2/pkg/bft/wal/wal.go:146 +0xca
2023-11-19T04:05:06.4154964Z 
2023-11-19T04:05:06.4155153Z goroutine 20837 [select]:
2023-11-19T04:05:06.4156010Z github.com/gnolang/gno/tm2/pkg/autofile.(*AutoFile).closeFileRoutine(0xc005cce4b0)
2023-11-19T04:05:06.4157231Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:93 +0x69
2023-11-19T04:05:06.4158261Z created by github.com/gnolang/gno/tm2/pkg/autofile.OpenAutoFile
2023-11-19T04:05:06.4159336Z 	/home/runner/work/gno/gno/tm2/pkg/autofile/autofile.go:75 +0x18a
2023-11-19T04:05:06.4159972Z 
2023-11-19T04:05:06.4160192Z goroutine 20839 [select, 19 minutes]:
2023-11-19T04:05:06.4160929Z github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).timeoutRoutine(0xc005b137a0)
2023-11-19T04:05:06.4162187Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:96 +0x196
2023-11-19T04:05:06.4163033Z created by github.com/gnolang/gno/tm2/pkg/bft/consensus.(*timeoutTicker).OnStart
2023-11-19T04:05:06.4163842Z 	/home/runner/work/gno/gno/tm2/pkg/bft/consensus/ticker.go:51 +0x96
2023-11-19T04:05:06.4164530Z FAIL	github.com/gnolang/gno/tm2/pkg/bft/consensus	1200.019s
2023-11-19T04:05:06.4165234Z ?   	github.com/gnolang/gno/tm2/pkg/bft/consensus/config	[no test files]

From this stack trace, we can see that the test has been stuck in TestProposeValidBlock for 19 minutes, more specifically at tm2/pkg/bft/consensus/state_test.go:1059:

panic: test timed out after 20m0s
running tests:
 	TestProposeValidBlock (19m37s)

The multiple [WARN] logs indicate that a vote event has been stuck and not consumed at tm2/pkg/events/subscribe.go:60:

[WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) for 1m20s
[WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) for 2m40s
[WARN] EventSwitch subscriber test-client blocked on {Vote{0:06EE40FAB2DD 1/04/1(Prevote) for 5m20s

The panic timeout isn't originating from gno but from the golang test suite, likely from github.com/golang/go/src/testing/testing.go:2366. Because the -timeout=20m parameter is specified in the go test command when running tests from the command you sent.

So it's unlikely to be linked with an stdout/stdin issue, or even related to gnovm.

@petar-dambovaliev
Copy link
Contributor

I am gonna take this over.

@moul
Copy link
Member Author

moul commented Jan 8, 2024

According to @zivkovicmilos, the short-term solution is #1495. For the long-term solution, Milos wants to collaborate with @petar-dambovaliev on a larger refactoring.

A potential temporary solution to provide more comfort for everyone is to label the tests as "flappy" so that they are checked intermittently instead of always (#202). However, it is advisable to avoid relying on this temporary system.

@moul moul moved this from To triage to In progress in TM2/GnoVM Jan 8, 2024
@jaekwon
Copy link
Contributor

jaekwon commented Jan 8, 2024

diff --git a/tm2/pkg/bft/consensus/state_test.go b/tm2/pkg/bft/consensus/state_test.go
index 35877837..713be1df 100644
--- a/tm2/pkg/bft/consensus/state_test.go
+++ b/tm2/pkg/bft/consensus/state_test.go
@@ -1055,6 +1055,7 @@ func TestProposeValidBlock(t *testing.T) {
        t.Log("### ONTO ROUND 4")
 
        ensureNewProposal(proposalCh, height, round)
+       ensurePrevote(voteCh, height, round)
 
        rs = cs1.GetRoundState()
        assert.True(t, bytes.Equal(rs.ProposalBlock.Hash(), propBlockHash))

^ that should fix it for the test TestProposeValidBlock. And likewise for similar flappy BFT tests.
There are probably no sync/lock bugs in Tendermint core bft; the reason why it is locked on cs1.GetRoundState() is because the consensus engine is locked trying to write an event (in this case a prevote event).

You will find that the deadlock happens toward the end of the test. This is because the race condition usually only manifests at the end where we forgot to read an event from a channel (and "ensure*" consumes these), and these are relatively easy to fix by looking at similar tests (or the same one above) and looking at the order of ensure calls there for each round.

It's also possible that the deadlock happens within a round in the middle of a test. In that case the fix should probably be easy: probably just need to move one or more ensure* calls up before a call like cs1.GetRoundState().

The WARN statements: "[WARN] EventSwitch subscriber test-client blocked on " were put there to help fix these tests. So please use them and also keep the WARN logging because it will come handy in the future.

The goroutine 20840 [select, 11 minutes]: for events.SubscribeFilteredOn is misleading because it's actually for 11m plus the time that produced the prior WARN messages: 40s + 1m20s + 2m40s + 5m20s = 600s = 10m. So this event send was actually stuck for 21 minutes it seems.

In the end, our tests will have precise event consumption testing, which is good. Any solution that tries to mask the problem by buffering the event system is just going to end up encountering some other more complex problem later. So please keep the BFT tests as is and keep adding ensure methods to completely cover all events and the order they are produced.

@moul
Copy link
Member Author

moul commented Jan 8, 2024

I attempted multiple times to make these tests fail locally, but was unsuccessful, even with -count 100.

Does anyone know how we can write comprehensive tests that can fail locally and not only on the CI? One idea is to use the following pseudocode:

for keyword1 in keywords {
    for keyword2 in keywords {
        setPerTestTimeout(...)
        node.init()
        node.add(keyword1)
        node.add(keyword2)
        node.processEvents()
        assertEmptyQueue()
    }
}

@jaekwon
Copy link
Contributor

jaekwon commented Jan 8, 2024

The problem stems from performance differences and spurts. That's not something you can generally detect for unless you use some tool to modify the timing of various goroutines randomly and allow for specifying behavior. I'm not suggesting that at all, that is crazy lol.

For the BFT tests we can do this:

--- a/tm2/pkg/bft/consensus/state_test.go
+++ b/tm2/pkg/bft/consensus/state_test.go
@@ -1055,6 +1055,9 @@ func TestProposeValidBlock(t *testing.T) {
        t.Log("### ONTO ROUND 4")
 
        ensureNewProposal(proposalCh, height, round)
+       //ensurePrevote(voteCh, height, round)
+.      longEnough := 10 * time.Second
+       time.Sleep(longEnough)
 
        rs = cs1.GetRoundState()
        assert.True(t, bytes.Equal(rs.ProposalBlock.Hash(), propBlockHash))

Wait long enough before any last state reads or assertions on BFT tests for the consensus round. Locally this should get yours to fail too. 10 seconds sounds be enough on a local laptop for BFT state calculations to go through. On CI maybe it needs to be a bit longer. But we can optionally enable this by setting a flag that optionally searchces for these cases rather than it being the norm. Nobody wants BFT test cases to each take 10 seconds longer lol.

gfanton added a commit to moul/gno that referenced this issue Jan 18, 2024
ref gnolang#1320 

This draft PR demonstrates a way for fixing flaky BFT tests. 
As mentioned in gnolang#1320 the main issue is that some channels are not fully
consumed, leading to deadlocks. By using buffered channels as proxy in
state tests, we can easily avoid these deadlocks. However, I don't think
this is the best approach, as these channels were originally set to zero
(capacity) for good reasons, but perhaps it could make sense for testing
purposes.
In any case, if anyone has a better idea for a solution, your
suggestions are welcome.

<details><summary>Contributors' checklist...</summary>

- [ ] Added new tests, or not needed, or not feasible
- [ ] Provided an example (e.g. screenshot) to aid review or the PR is
self-explanatory
- [ ] Updated the official documentation or not needed
- [ ] No breaking changes were made, or a `BREAKING CHANGE: xxx` message
was included in the description
- [ ] Added references to related issues and PRs
- [ ] Provided any useful hints for running manual tests
- [ ] Added new benchmarks to [generated
graphs](https://gnoland.github.io/benchmarks), if any. More info
[here](https://github.com/gnolang/gno/blob/master/.benchmarks/README.md).
</details>

---------

Signed-off-by: gfanton <[email protected]>
Co-authored-by: Manfred Touron <[email protected]>
gfanton added a commit to moul/gno that referenced this issue Jan 18, 2024
This PR reverts gnolang#1385 following @jaekwon
gnolang#1320 (comment)

<!-- please provide a detailed description of the changes made in this
pull request. -->

<details><summary>Contributors' checklist...</summary>

- [ ] Added new tests, or not needed, or not feasible
- [ ] Provided an example (e.g. screenshot) to aid review or the PR is
self-explanatory
- [ ] Updated the official documentation or not needed
- [ ] No breaking changes were made, or a `BREAKING CHANGE: xxx` message
was included in the description
- [ ] Added references to related issues and PRs
- [ ] Provided any useful hints for running manual tests
- [ ] Added new benchmarks to [generated
graphs](https://gnoland.github.io/benchmarks), if any. More info
[here](https://github.com/gnolang/gno/blob/master/.benchmarks/README.md).
</details>

Signed-off-by: gfanton <[email protected]>
@thehowl thehowl added this to the 🏗4️⃣ test4.gno.land milestone Apr 11, 2024
@Kouteki Kouteki moved this from Triage to Backlog in 🧙‍♂️gno.land core team Apr 19, 2024
@Kouteki Kouteki moved this from Backlog to In Progress in 🧙‍♂️gno.land core team May 15, 2024
@zivkovicmilos
Copy link
Member

This has been fixed as part of #1515

@github-project-automation github-project-automation bot moved this from In progress to Done in TM2/GnoVM May 26, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in 🧙‍♂️gno.land core team May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: 🚀 Needed for Launch
Development

No branches or pull requests

7 participants