-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make make _test.pkg.bft
consistent, especially on CI (tm2/test
)
#1320
Comments
It should prevent certain flaky tests from failing on GH. In any case, it increases the high-level timeout to be greater than the timeout for running 'go test'. Related to issue #1320.
It should prevent certain flaky tests from failing on GH. In any case, it increases the high-level timeout to be greater than the timeout for running 'go test'. Related to issue gnolang#1320.
It should prevent certain flaky tests from failing on GH. In any case, it increases the high-level timeout to be greater than the timeout for running 'go test'. Related to issue gnolang#1320.
I've extracted two stack traces from two different CI logs. stacktrace 1
stacktrace 2
From these traces, we can deduce that there is a deadlock that can appear in multiple tests from
The first conflicting lock appears when calling ConsensusState.GetRoundState(), which in turn calls The second lock appears in ConsensusState.handleMsg. After locking example on func TestSetValidBlockOnDelayedPrevote(t *testing.T) {
t.Parallel()
cs1, vss := randConsensusState(4)
vs2, vs3, vs4 := vss[1], vss[2], vss[3]
height, round := cs1.Height, cs1.Round
partSize := types.BlockPartSizeBytes
proposalCh := subscribe(cs1.evsw, cstypes.EventCompleteProposal{})
timeoutWaitCh := subscribe(cs1.evsw, cstypes.EventTimeoutWait{})
newRoundCh := subscribe(cs1.evsw, cstypes.EventNewRound{})
validBlockCh := subscribe(cs1.evsw, cstypes.EventNewValidBlock{})
addr := cs1.privValidator.GetPubKey().Address()
/// A vote subscriber is created here, which ultimately receive an event that is never consumed by this test.
voteCh := subscribeToVoter(cs1, addr)
// ... more code
ensureNewValidBlock(validBlockCh, height, round)
/// deadlock here
rs = cs1.GetRoundState()
assert.True(t, bytes.Equal(rs.ValidBlock.Hash(), propBlockHash))
assert.True(t, rs.ValidBlockParts.Header().Equals(propBlockParts.Header()))
assert.True(t, rs.ValidRound == round)
} |
ref #1320 This draft PR demonstrates a way for fixing flaky BFT tests. As mentioned in #1320 the main issue is that some channels are not fully consumed, leading to deadlocks. By using buffered channels as proxy in state tests, we can easily avoid these deadlocks. However, I don't think this is the best approach, as these channels were originally set to zero (capacity) for good reasons, but perhaps it could make sense for testing purposes. In any case, if anyone has a better idea for a solution, your suggestions are welcome. <details><summary>Contributors' checklist...</summary> - [ ] Added new tests, or not needed, or not feasible - [ ] Provided an example (e.g. screenshot) to aid review or the PR is self-explanatory - [ ] Updated the official documentation or not needed - [ ] No breaking changes were made, or a `BREAKING CHANGE: xxx` message was included in the description - [ ] Added references to related issues and PRs - [ ] Provided any useful hints for running manual tests - [ ] Added new benchmarks to [generated graphs](https://gnoland.github.io/benchmarks), if any. More info [here](https://github.com/gnolang/gno/blob/master/.benchmarks/README.md). </details> --------- Signed-off-by: gfanton <[email protected]> Co-authored-by: Manfred Touron <[email protected]>
The most explicit solution is to consume the event, synchronously. And with more events we may want to create helpers to consume events, but always synchronously/deterministically. So this should be reverted https://github.com/gnolang/gno/pull/1385/files as it masks opportunities to do the above. Eventually solutions like https://github.com/gnolang/gno/pull/1385/files will start to fail in unexpected ways (when the buffer becomes full at high workload).
If you can't figure out the root cause then I can work on it, but in any case we should revert 1385 first. |
@jaekwon, @gfanton has opened #1441 to revert the PR. I request you to keep it open instead of merging. The situation was causing a blockage for everyone, so I prefer that we handle this case gradually and effectively on the side. I promise that we will not launch mainnet without resolving the underlying problem. Let's discuss this further during our next call. @petar-dambovaliev, could you please add this low-level problem to your list of topics to review? I think it would be a great fit for you. |
This PR reverts #1385 following @jaekwon #1320 (comment) <!-- please provide a detailed description of the changes made in this pull request. --> <details><summary>Contributors' checklist...</summary> - [ ] Added new tests, or not needed, or not feasible - [ ] Provided an example (e.g. screenshot) to aid review or the PR is self-explanatory - [ ] Updated the official documentation or not needed - [ ] No breaking changes were made, or a `BREAKING CHANGE: xxx` message was included in the description - [ ] Added references to related issues and PRs - [ ] Provided any useful hints for running manual tests - [ ] Added new benchmarks to [generated graphs](https://gnoland.github.io/benchmarks), if any. More info [here](https://github.com/gnolang/gno/blob/master/.benchmarks/README.md). </details> Signed-off-by: gfanton <[email protected]>
As mentioned here, I have reverted the previous hack. Now, our priority should be to fix it instead of allowing it to grow over time. Let's seek help from @jaekwon and discuss it during one of the upcoming team meetings. @petar-dambovaliev, could you take a look in the next few days? I have a feeling that you can identify the issue quickly with your fast debugging skills. We need to determine the cause of the ratio increase in the failing tests. Did we introduce it by increasing parallelism or is it something else? Feel free to share tips on the fastest way to reproduce without running the entire lengthy test suite. |
I will also try to give another look, but let's make sure to have a review from @jaekwon this time |
@gfanton Starting from which commit/version did we have this issue on CI? If it did not appear in an earlier version, some change must have caused the testing case to fail. |
@piux2 I didn't attempt to do a bisect; that was on my to-do list. However, since I'm not able to reproduce the bug locally, I want to focus on that first. Based on what others are saying, it seems like this test has been failing for a long time. |
I reviewed the error message and logs, I doubt the testing behavior are rooted in pkg/bft test itself. https://github.com/gnolang/gno/actions/runs/6918163494/job/18820116982?pr=1309 The panic error message in the trace above is from following code. "test timed out after 20m0s" Line 198 in 968f89e
Basically it panics if the entire CI suite can not complete in 20mins. If we look at the start time and end time of pkg/bft CI log. It is about 20 mins. root causeThere is another deeper problem. How could a cmd/gno/test.go leak the error message to tm2/pkg/bft test log and take the input from tm2 test? It turns out the commands.IO passed in here wrapping a global os.Stdout and os.Stdin which could cause the issue because all go programs share the same os.Stdin and os.Stdout. Line 41 in 968f89e
Usually Golang os.Stdin and os.Stdout have separate *os.File file descriptor in its own process space. However, if all these instances are deployed on the same docker container, they could all reading from and writing to the same container’s stdin / stidout stream. the workflow running concurrently on docker image ghcr.io/gnolang/gno Line 20 in 968f89e
WDYT? |
This is definitely coming from bft
From this stack trace, we can see that the test has been stuck in
The multiple
The panic timeout isn't originating from So it's unlikely to be linked with an |
I am gonna take this over. |
According to @zivkovicmilos, the short-term solution is #1495. For the long-term solution, Milos wants to collaborate with @petar-dambovaliev on a larger refactoring. A potential temporary solution to provide more comfort for everyone is to label the tests as "flappy" so that they are checked intermittently instead of always (#202). However, it is advisable to avoid relying on this temporary system. |
diff --git a/tm2/pkg/bft/consensus/state_test.go b/tm2/pkg/bft/consensus/state_test.go
index 35877837..713be1df 100644
--- a/tm2/pkg/bft/consensus/state_test.go
+++ b/tm2/pkg/bft/consensus/state_test.go
@@ -1055,6 +1055,7 @@ func TestProposeValidBlock(t *testing.T) {
t.Log("### ONTO ROUND 4")
ensureNewProposal(proposalCh, height, round)
+ ensurePrevote(voteCh, height, round)
rs = cs1.GetRoundState()
assert.True(t, bytes.Equal(rs.ProposalBlock.Hash(), propBlockHash)) ^ that should fix it for the test TestProposeValidBlock. And likewise for similar flappy BFT tests. You will find that the deadlock happens toward the end of the test. This is because the race condition usually only manifests at the end where we forgot to read an event from a channel (and "ensure*" consumes these), and these are relatively easy to fix by looking at similar tests (or the same one above) and looking at the order of ensure calls there for each round. It's also possible that the deadlock happens within a round in the middle of a test. In that case the fix should probably be easy: probably just need to move one or more ensure* calls up before a call like cs1.GetRoundState(). The WARN statements: "[WARN] EventSwitch subscriber test-client blocked on " were put there to help fix these tests. So please use them and also keep the WARN logging because it will come handy in the future. The In the end, our tests will have precise event consumption testing, which is good. Any solution that tries to mask the problem by buffering the event system is just going to end up encountering some other more complex problem later. So please keep the BFT tests as is and keep adding ensure methods to completely cover all events and the order they are produced. |
I attempted multiple times to make these tests fail locally, but was unsuccessful, even with Does anyone know how we can write comprehensive tests that can fail locally and not only on the CI? One idea is to use the following pseudocode:
|
The problem stems from performance differences and spurts. That's not something you can generally detect for unless you use some tool to modify the timing of various goroutines randomly and allow for specifying behavior. I'm not suggesting that at all, that is crazy lol. For the BFT tests we can do this: --- a/tm2/pkg/bft/consensus/state_test.go
+++ b/tm2/pkg/bft/consensus/state_test.go
@@ -1055,6 +1055,9 @@ func TestProposeValidBlock(t *testing.T) {
t.Log("### ONTO ROUND 4")
ensureNewProposal(proposalCh, height, round)
+ //ensurePrevote(voteCh, height, round)
+. longEnough := 10 * time.Second
+ time.Sleep(longEnough)
rs = cs1.GetRoundState()
assert.True(t, bytes.Equal(rs.ProposalBlock.Hash(), propBlockHash)) Wait long enough before any last state reads or assertions on BFT tests for the consensus round. Locally this should get yours to fail too. 10 seconds sounds be enough on a local laptop for BFT state calculations to go through. On CI maybe it needs to be a bit longer. But we can optionally enable this by setting a flag that optionally searchces for these cases rather than it being the norm. Nobody wants BFT test cases to each take 10 seconds longer lol. |
ref gnolang#1320 This draft PR demonstrates a way for fixing flaky BFT tests. As mentioned in gnolang#1320 the main issue is that some channels are not fully consumed, leading to deadlocks. By using buffered channels as proxy in state tests, we can easily avoid these deadlocks. However, I don't think this is the best approach, as these channels were originally set to zero (capacity) for good reasons, but perhaps it could make sense for testing purposes. In any case, if anyone has a better idea for a solution, your suggestions are welcome. <details><summary>Contributors' checklist...</summary> - [ ] Added new tests, or not needed, or not feasible - [ ] Provided an example (e.g. screenshot) to aid review or the PR is self-explanatory - [ ] Updated the official documentation or not needed - [ ] No breaking changes were made, or a `BREAKING CHANGE: xxx` message was included in the description - [ ] Added references to related issues and PRs - [ ] Provided any useful hints for running manual tests - [ ] Added new benchmarks to [generated graphs](https://gnoland.github.io/benchmarks), if any. More info [here](https://github.com/gnolang/gno/blob/master/.benchmarks/README.md). </details> --------- Signed-off-by: gfanton <[email protected]> Co-authored-by: Manfred Touron <[email protected]>
This PR reverts gnolang#1385 following @jaekwon gnolang#1320 (comment) <!-- please provide a detailed description of the changes made in this pull request. --> <details><summary>Contributors' checklist...</summary> - [ ] Added new tests, or not needed, or not feasible - [ ] Provided an example (e.g. screenshot) to aid review or the PR is self-explanatory - [ ] Updated the official documentation or not needed - [ ] No breaking changes were made, or a `BREAKING CHANGE: xxx` message was included in the description - [ ] Added references to related issues and PRs - [ ] Provided any useful hints for running manual tests - [ ] Added new benchmarks to [generated graphs](https://gnoland.github.io/benchmarks), if any. More info [here](https://github.com/gnolang/gno/blob/master/.benchmarks/README.md). </details> Signed-off-by: gfanton <[email protected]>
This has been fixed as part of #1515 |
This is the test with the highest flappiness/flakiness.
Related with #202
The text was updated successfully, but these errors were encountered: