-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bank hash mismatch causing dropped votes #7736
Comments
If this reproduces again, the additional logging at #7733 should help |
Thanks for narrowing down into this! Status report: I've carefully read suspicious relevant code with all of my knowledge employed. However no thing rings a bell so far... I'll try to repro it locally. |
Phew, I've found something interesting. Stay tuned! |
As always, I dunno this is really the culprit or immediate root cause of it.. It seems that the current run-time indeterministically managed to execute identical or (very similar) vote transactions on some validators and not on others. Thus, this could cause differing The First let's see a survived incident where the cluster survived (Although I'll quote actual validator's node key for clarity and further investigation, no blame for them as it's probabilistic nature; it can happen to anyone):
The Next the lethal incident:
In this case, after duplicate execution of similar vote TX, various validators started to vote with differing hashes, ultimately leading to the cluster death. |
Another point of note is that when the ledger from those errant nodes is replayed using |
I'm running out of gas for this week... @mvines Yeah, that should be kept in mind! As far as I've glanced the codebase, no quick success for hunting the root cause. I'm suspecting around StatusCache and FeeCalculator/Collector. Because these failed Vote TXes are |
Another approach to the investigation in my mind is that since we have plenty of logs, we can create a graph of how the vote hash discrepancy is spread over the incident time in the cluster by graphviz with relatively less effort. Combined with corresponding leader schedule, this might shed some light.... |
Oh, the last one, just see what the situation is in the ledger when those error happened at that times. |
I took the F7F validator log case: The first one A seems to be 320738.
Second one seems to be on fork: 320803
I think it's likely it's just the same vote applied on two different forks. |
@sakridge Thanks for checking out! So, it seems that these log patterns are harmless. Another suspicious error is:
This occurs a lot only on 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af; and it seems there is no corresponding error at the other side validator. Does the requesting validator can really correctly handle error? |
Another logs from validators around the deadly event
|
It just means when he sent the repair response that some kind of error was encountered when trying to send a packet. I'm not really sure exactly what can cause that, maybe a full udp buffer or some kind of network driver/kernel issue. But the requesting node would ask other nodes to also send it to him at some point, so it shouldn't be fatal. Maybe if every node on the network somehow got into this state where they could not send anything, but that doesn't seem to be the case. I would like to understand better about what can cause this. |
Admittedly, this was hard to repro. However, finally and finally it seems that I could manged to repro this locally.... (stay tuned)
|
Your guess for the cause is correct; I elaborated on this a bit here: #7840 :) |
Here is the long-waited test steps to reproduce this scary bank hash mismatch bug. :) Basic idea is that we must focus to stress-test the banking code as much as possible while excluding other subsystems. To that purpose, I was forced to disable any of signature generation and verification and to increase the banking threads to somewhat insane number. As this bug was discovered in this way, I think it's generally worthwhile to maintain these development flags. I know beefy machines with CUDA might be suffice but debugging without the Steps To Reproduce:
Once I manged to reproduce this bug locally, I could narrowed the bug down step by step, ultimately resulting in this PR (#7797). Without these PRs, it's really hard to pin down the suspicious subsystem unless we know it's the @mvines @danpaul000 Once merged, I think we should also add a test scenario like this STR as one of our daily bench/sanity tests. So that we should be able to notice any race-condition regressions early, which tend to be very hard to debug. Any inputs regarding this? FYI: @carllin (as I was questioned for the STR; here is the meat!) |
DR6 failed because more than 33% of the validators somehow generated an inconsistent bank hash which caused the remainder of the cluster to reject their votes:
solana/programs/vote/src/vote_state.rs
Lines 275 to 279 in 719785a
The ledgers from both groups appear to be the same, and when a ledger from the inconsistent banh hash group is run through
solana-ledger-tool verify
the correct bank hash as produced. So there appears to be a runtime race condition that is causingsolana/runtime/src/bank.rs
Line 677 in c33b547
The text was updated successfully, but these errors were encountered: