Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bank hash mismatch causing dropped votes #7736

Closed
mvines opened this issue Jan 10, 2020 · 15 comments · Fixed by #7797
Closed

bank hash mismatch causing dropped votes #7736

mvines opened this issue Jan 10, 2020 · 15 comments · Fixed by #7797
Assignees

Comments

@mvines
Copy link
Member

mvines commented Jan 10, 2020

DR6 failed because more than 33% of the validators somehow generated an inconsistent bank hash which caused the remainder of the cluster to reject their votes:

warn!(
"{} dropped vote {:?} failed to match hash {} {}",
self.node_pubkey, vote, vote.hash, slot_hashes[j].1
);
return Err(VoteError::SlotHashMismatch);

The ledgers from both groups appear to be the same, and when a ledger from the inconsistent banh hash group is run through solana-ledger-tool verify the correct bank hash as produced. So there appears to be a runtime race condition that is causing

*hash = self.hash_internal_state();
to vary.

@mvines mvines added this to the Supertubes v0.22.3 milestone Jan 10, 2020
@mvines
Copy link
Member Author

mvines commented Jan 10, 2020

If this reproduces again, the additional logging at #7733 should help

@ryoqun
Copy link
Contributor

ryoqun commented Jan 10, 2020

Thanks for narrowing down into this!

Status report: I've carefully read suspicious relevant code with all of my knowledge employed. However no thing rings a bell so far... I'll try to repro it locally.

@ryoqun
Copy link
Contributor

ryoqun commented Jan 10, 2020

Phew, I've found something interesting. Stay tuned!

@ryoqun
Copy link
Contributor

ryoqun commented Jan 10, 2020

As always, I dunno this is really the culprit or immediate root cause of it..

It seems that the current run-time indeterministically managed to execute identical or (very similar) vote transactions on some validators and not on others. Thus, this could cause differing signature_count (and inconsistent bank hash) across the validators in the cluster. If I'm not very sure but, validator should keep record of failed TXes as part of ledger state, assuming repeated execution of identical TXes should be prevented for ledger integrity?

The failed to match hash errors has been observed several times before the complete collapse of the cluster; So it means the cluster can survive and it has some resistance to the error. Here, my theory is that even if such a diverging state should appear, the cluster's normal immunity would be able kill such a minority mutation. That's cool; However, it couldn't beat the last mighty mutant at around slot 320794. Maybe exaggerated by some cascading/avalanche effect?

First let's see a survived incident where the cluster survived (Although I'll quote actual validator's node key for clarity and further investigation, no blame for them as it's probabilistic nature; it can happen to anyone):

log from H4Dgb3KyCuYWKT8yKtp8qbY49cvaqZcisa2GDnroFsv6:
  [2020-01-09T05:15:04.205084664Z WARN  solana_vote_program::vote_state] GVthNzmh8UUJjnVqZNDKEjn43hPmHm7WRFw32GBoX6Qy dropped vote Vote { slots: [274788, 274789, 274791, 274792, 274793, 274794, 274795, 274796, 274797, 274798, 274799, 274800, 274841, 274842], hash: 3M9uzHb5GBXsA6WpaCe2s7Rgn54QyXmkowzAMkJACQ9q, timestamp: None } failed to match hash 3M9uzHb5GBXsA6WpaCe2s7Rgn54QyXmkowzAMkJACQ9q HVDmEsu3wWgrEDNKKYXT7BTyFbp9QAGQoaR9LkqfSmvc
! [2020-01-09T05:15:04.560712419Z WARN  solana_vote_program::vote_state] GVthNzmh8UUJjnVqZNDKEjn43hPmHm7WRFw32GBoX6Qy dropped vote Vote { slots: [274793, 274794, 274795, 274796, 274797, 274798, 274799, 274800, 274841, 274845], hash: 4zasPp8viBFHU32kP3j9h4fQpWq3VVk3f6brZnDF3MNH, timestamp: None } failed to match hash 4zasPp8viBFHU32kP3j9h4fQpWq3VVk3f6brZnDF3MNH B4AM9tMJK4TmH8a4pAR5EwKnvw7ag1DxfjGrcWfLswZG
! [2020-01-09T05:15:04.735320197Z WARN  solana_vote_program::vote_state] GVthNzmh8UUJjnVqZNDKEjn43hPmHm7WRFw32GBoX6Qy dropped vote Vote { slots: [274793, 274794, 274795, 274796, 274797, 274798, 274799, 274800, 274841, 274845], hash: 4zasPp8viBFHU32kP3j9h4fQpWq3VVk3f6brZnDF3MNH, timestamp: None } failed to match hash 4zasPp8viBFHU32kP3j9h4fQpWq3VVk3f6brZnDF3MNH B4AM9tMJK4TmH8a4pAR5EwKnvw7ag1DxfjGrcWfLswZG
  [2020-01-09T05:15:07.700188717Z WARN  solana_vote_program::vote_state] GVthNzmh8UUJjnVqZNDKEjn43hPmHm7WRFw32GBoX6Qy dropped vote Vote { slots: [274796, 274797, 274798, 274799, 274800, 274848], hash: C6AM8Y37LZrmfagNpBXrevh6AZzHEoHBwQF3yexAhS5p, timestamp: None } failed to match hash C6AM8Y37LZrmfagNpBXrevh6AZzHEoHBwQF3yexAhS5p A6fJwq1J8JYa9jXwXDQ2Db7DmsR7TWsMMnLdC1gx6j9Q

log from F7FgS6rrWckgC5X4cP5WtRRp3U1u12nnuTRXbWYaKn1u:
  [2020-01-09T05:15:07.468517435Z WARN  solana_vote_program::vote_state] GVthNzmh8UUJjnVqZNDKEjn43hPmHm7WRFw32GBoX6Qy dropped vote Vote { slots: [274788, 274789, 274791, 274792, 274793, 274794, 274795, 274796, 274797, 274798, 274799, 274800, 274841, 274842], hash: 3M9uzHb5GBXsA6WpaCe2s7Rgn54QyXmkowzAMkJACQ9q, timestamp: None } failed to match hash 3M9uzHb5GBXsA6WpaCe2s7Rgn54QyXmkowzAMkJACQ9q HVDmEsu3wWgrEDNKKYXT7BTyFbp9QAGQoaR9LkqfSmvc
! [2020-01-09T05:15:08.046465034Z WARN  solana_vote_program::vote_state] GVthNzmh8UUJjnVqZNDKEjn43hPmHm7WRFw32GBoX6Qy dropped vote Vote { slots: [274793, 274794, 274795, 274796, 274797, 274798, 274799, 274800, 274841, 274845], hash: 4zasPp8viBFHU32kP3j9h4fQpWq3VVk3f6brZnDF3MNH, timestamp: None } failed to match hash 4zasPp8viBFHU32kP3j9h4fQpWq3VVk3f6brZnDF3MNH B4AM9tMJK4TmH8a4pAR5EwKnvw7ag1DxfjGrcWfLswZG
  [2020-01-09T05:15:10.885943900Z WARN  solana_vote_program::vote_state] GVthNzmh8UUJjnVqZNDKEjn43hPmHm7WRFw32GBoX6Qy dropped vote Vote { slots: [274796, 274797, 274798, 274799, 274800, 274848], hash: C6AM8Y37LZrmfagNpBXrevh6AZzHEoHBwQF3yexAhS5p, timestamp: None } failed to match hash C6AM8Y37LZrmfagNpBXrevh6AZzHEoHBwQF3yexAhS5p A6fJwq1J8JYa9jXwXDQ2Db7DmsR7TWsMMnLdC1gx6j9Q

The !-prefixed lines indicates that some validator processed the vote TX once and the other did it twice for some mysterious reason. This is tiny excerpt of the incident. In the full course of the error, failed to match hash was only reported for the node pubkey GVthNzmh8UUJjnVqZNDKEjn43hPmHm7WRFw32GBoX6Qy. And the order of log entries are almost same regardless of logs and eventually the error burst stopped as if nothing happened.

Next the lethal incident:

log from H4Dgb3KyCuYWKT8yKtp8qbY49cvaqZcisa2GDnroFsv6:
!A [2020-01-09T11:57:11.767497926Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794], hash: HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR, timestamp: None } failed to match hash HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR ANCfD4gmtdGkvYuPnJqYq4j1aGSMxNC67To3Bb2tkYAD
!B [2020-01-09T11:57:11.868219515Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794, 320795], hash: E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT, timestamp: None } failed to match hash E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT AUDKk5DfCEYHKswaA9gmuSQeMw8auKx7wyPoWbTEToWN
!B [2020-01-09T11:57:17.028713478Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794, 320795], hash: E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT, timestamp: None } failed to match hash E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT AUDKk5DfCEYHKswaA9gmuSQeMw8auKx7wyPoWbTEToWN
!A [2020-01-09T11:57:17.056845070Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794], hash: HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR, timestamp: None } failed to match hash HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR ANCfD4gmtdGkvYuPnJqYq4j1aGSMxNC67To3Bb2tkYAD
   [2020-01-09T11:57:18.759706697Z WARN  solana_vote_program::vote_state] boot1Z6jb15CLqpaMTn2CxktktwZpRAVAgHZEW6SxQ7 dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
   [2020-01-09T11:57:19.147150431Z WARN  solana_vote_program::vote_state] boot1Z6jb15CLqpaMTn2CxktktwZpRAVAgHZEW6SxQ7 dropped vote Vote { slots: [320797], hash: GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX, timestamp: None } failed to match hash GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX 7JCEYA1ZqC23LS2Dtg4RkaxL9XjAUUkezfKCufvgQQPq
   [2020-01-09T11:57:19.415032765Z WARN  solana_vote_program::vote_state] boot1Z6jb15CLqpaMTn2CxktktwZpRAVAgHZEW6SxQ7 dropped vote Vote { slots: [320798], hash: EeiQ4ccromw2jiCZUsxRhbMoXwnrZrX8nVpt82quLpzq, timestamp: None } failed to match hash EeiQ4ccromw2jiCZUsxRhbMoXwnrZrX8nVpt82quLpzq 7NgQK2X1tQzsLWvVDcGePUv5FjXVXVzYCALnWqWbJubJ
   [2020-01-09T11:57:19.439106878Z WARN  solana_vote_program::vote_state] 8oRw7qpj6XgLGXYCDuNoTMCqoJnDd6A8LTpNyqApSfkA dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
   [2020-01-09T11:57:19.474036831Z WARN  solana_vote_program::vote_state] D71JRzjPpHipt8NAWnWb3yZoXezbkGXqSf7TVCir6wvT dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
   [2020-01-09T11:57:19.474242632Z WARN  solana_vote_program::vote_state] 5NH47Zk9NAzfbtqNpUtn8CQgNZeZE88aa2NRpfe7DyTD dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
   [2020-01-09T11:57:19.479717878Z WARN  solana_vote_program::vote_state] 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
   [2020-01-09T11:57:19.534473871Z WARN  solana_vote_program::vote_state] 9uEazQxpRTyYX1hHMMySDBWad7zb54K9PkHKeZemK2m7 dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
   [2020-01-09T11:57:19.585981995Z WARN  solana_vote_program::vote_state] 4XWxphAh1Ji9p3dYMNRNtW3sbmr5Z1cvsGyJXJx5Jvfy dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
   [2020-01-09T11:57:19.586563852Z WARN  solana_vote_program::vote_state] 2DvsPbbKrBaJm7SbdVvRjZL1NGCU3MwciGCoCw42fTMu dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
   [2020-01-09T11:57:19.587217721Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320796], hash: 8oWuB7N74wPbZKKwbBiPLCTU8KuLnWgL6hqVy9NdEc8n, timestamp: None } failed to match hash 8oWuB7N74wPbZKKwbBiPLCTU8KuLnWgL6hqVy9NdEc8n FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t

log from F7FgS6rrWckgC5X4cP5WtRRp3U1u12nnuTRXbWYaKn1u:
!A [2020-01-09T11:57:11.963644027Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794], hash: HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR, timestamp: None } failed to match hash HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR ANCfD4gmtdGkvYuPnJqYq4j1aGSMxNC67To3Bb2tkYAD
!B [2020-01-09T11:57:12.005661857Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794, 320795], hash: E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT, timestamp: None } failed to match hash E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT AUDKk5DfCEYHKswaA9gmuSQeMw8auKx7wyPoWbTEToWN
   [2020-01-09T11:57:15.872688385Z WARN  solana_vote_program::vote_state] boot1Z6jb15CLqpaMTn2CxktktwZpRAVAgHZEW6SxQ7 dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx
!B [2020-01-09T11:57:17.132964870Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794, 320795], hash: E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT, timestamp: None } failed to match hash E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT AUDKk5DfCEYHKswaA9gmuSQeMw8auKx7wyPoWbTEToWN
!A [2020-01-09T11:57:17.189839200Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794], hash: HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR, timestamp: None } failed to match hash HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR ANCfD4gmtdGkvYuPnJqYq4j1aGSMxNC67To3Bb2tkYAD
   [2020-01-09T11:57:18.017377593Z WARN  solana_vote_program::vote_state] boot1Z6jb15CLqpaMTn2CxktktwZpRAVAgHZEW6SxQ7 dropped vote Vote { slots: [320797], hash: GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX, timestamp: None } failed to match hash GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX B3gRZSNcD6g4fGzUQkQfwVhMAiVMBGaVGXQcdy9ZV13o
   [2020-01-09T11:57:18.541420835Z WARN  solana_vote_program::vote_state] boot1Z6jb15CLqpaMTn2CxktktwZpRAVAgHZEW6SxQ7 dropped vote Vote { slots: [320798], hash: EeiQ4ccromw2jiCZUsxRhbMoXwnrZrX8nVpt82quLpzq, timestamp: None } failed to match hash EeiQ4ccromw2jiCZUsxRhbMoXwnrZrX8nVpt82quLpzq 4omD3ymsGo4WtYfTF3FnemrF2uESZD5kU8BTQKaN5aG7
   [2020-01-09T11:57:18.569238287Z WARN  solana_vote_program::vote_state] 5NH47Zk9NAzfbtqNpUtn8CQgNZeZE88aa2NRpfe7DyTD dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx
   [2020-01-09T11:57:18.569378604Z WARN  solana_vote_program::vote_state] D71JRzjPpHipt8NAWnWb3yZoXezbkGXqSf7TVCir6wvT dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx
   [2020-01-09T11:57:18.576298575Z WARN  solana_vote_program::vote_state] 8oRw7qpj6XgLGXYCDuNoTMCqoJnDd6A8LTpNyqApSfkA dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx
   [2020-01-09T11:57:18.580355175Z WARN  solana_vote_program::vote_state] 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx
   [2020-01-09T11:57:18.631536851Z WARN  solana_vote_program::vote_state] 9uEazQxpRTyYX1hHMMySDBWad7zb54K9PkHKeZemK2m7 dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx
   [2020-01-09T11:57:18.651061441Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320796], hash: 8oWuB7N74wPbZKKwbBiPLCTU8KuLnWgL6hqVy9NdEc8n, timestamp: None } failed to match hash 8oWuB7N74wPbZKKwbBiPLCTU8KuLnWgL6hqVy9NdEc8n 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx
   [2020-01-09T11:57:18.652338230Z WARN  solana_vote_program::vote_state] 4XWxphAh1Ji9p3dYMNRNtW3sbmr5Z1cvsGyJXJx5Jvfy dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx
   [2020-01-09T11:57:18.653442597Z WARN  solana_vote_program::vote_state] 2DvsPbbKrBaJm7SbdVvRjZL1NGCU3MwciGCoCw42fTMu dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx

In this case, after duplicate execution of similar vote TX, various validators started to vote with differing hashes, ultimately leading to the cluster death.

@mvines
Copy link
Member Author

mvines commented Jan 10, 2020

Another point of note is that when the ledger from those errant nodes is replayed using solana-ledger-tool it accepts the bank hashes for the majority and rejects its own votes, implying the ledger data itself is in fact correct and consistent across all machines

@ryoqun
Copy link
Contributor

ryoqun commented Jan 10, 2020

I'm running out of gas for this week...

@mvines Yeah, that should be kept in mind!

As far as I've glanced the codebase, no quick success for hunting the root cause.

I'm suspecting around StatusCache and FeeCalculator/Collector. Because these failed Vote TXes are is_commit() == true, StatusCache and FeeCalculator/Collector are doing their bussiness for each failed TX and might cause state differences depending on the number of processing or the order of it only in runtime not in the ledger while a validator running.

@ryoqun
Copy link
Contributor

ryoqun commented Jan 10, 2020

Another approach to the investigation in my mind is that since we have plenty of logs, we can create a graph of how the vote hash discrepancy is spread over the incident time in the cluster by graphviz with relatively less effort. Combined with corresponding leader schedule, this might shed some light....

@ryoqun
Copy link
Contributor

ryoqun commented Jan 10, 2020

Oh, the last one, just see what the situation is in the ledger when those error happened at that times.

@sakridge
Copy link
Member

I took the F7F validator log case:

The first one A seems to be 320738.

new fork:320738 parent:320721 root:320669

Second one seems to be on fork: 320803

new fork:320803 parent:320795 root:320679

I think it's likely it's just the same vote applied on two different forks.

@ryoqun
Copy link
Contributor

ryoqun commented Jan 11, 2020

@sakridge Thanks for checking out! So, it seems that these log patterns are harmless.

Another suspicious error is:

log from 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af:
[2020-01-09T11:57:09.623904040Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
[2020-01-09T11:57:09.623915739Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
[2020-01-09T11:57:09.623924490Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }

This occurs a lot only on 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af; and it seems there is no corresponding error at the other side validator. Does the requesting validator can really correctly handle error?

@ryoqun
Copy link
Contributor

ryoqun commented Jan 11, 2020

Another logs from validators around the deadly event

log from H4Dgb3KyCuYWKT8yKtp8qbY49cvaqZcisa2GDnroFsv6
[2020-01-09T11:57:11.767497926Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794], hash: HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR, timestamp: None } failed to match hash HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR ANCfD4gmtdGkvYuPnJqYq4j1aGSMxNC67To3Bb2tkYAD
[2020-01-09T11:57:11.868219515Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794, 320795], hash: E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT, timestamp: None } failed to match hash E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT AUDKk5DfCEYHKswaA9gmuSQeMw8auKx7wyPoWbTEToWN
[2020-01-09T11:57:17.028713478Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794, 320795], hash: E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT, timestamp: None } failed to match hash E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT AUDKk5DfCEYHKswaA9gmuSQeMw8auKx7wyPoWbTEToWN
[2020-01-09T11:57:17.056845070Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794], hash: HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR, timestamp: None } failed to match hash HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR ANCfD4gmtdGkvYuPnJqYq4j1aGSMxNC67To3Bb2tkYAD
[2020-01-09T11:57:18.759706697Z WARN  solana_vote_program::vote_state] boot1Z6jb15CLqpaMTn2CxktktwZpRAVAgHZEW6SxQ7 dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
[2020-01-09T11:57:19.147150431Z WARN  solana_vote_program::vote_state] boot1Z6jb15CLqpaMTn2CxktktwZpRAVAgHZEW6SxQ7 dropped vote Vote { slots: [320797], hash: GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX, timestamp: None } failed to match hash GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX 7JCEYA1ZqC23LS2Dtg4RkaxL9XjAUUkezfKCufvgQQPq
[2020-01-09T11:57:19.415032765Z WARN  solana_vote_program::vote_state] boot1Z6jb15CLqpaMTn2CxktktwZpRAVAgHZEW6SxQ7 dropped vote Vote { slots: [320798], hash: EeiQ4ccromw2jiCZUsxRhbMoXwnrZrX8nVpt82quLpzq, timestamp: None } failed to match hash EeiQ4ccromw2jiCZUsxRhbMoXwnrZrX8nVpt82quLpzq 7NgQK2X1tQzsLWvVDcGePUv5FjXVXVzYCALnWqWbJubJ
[2020-01-09T11:57:19.439106878Z WARN  solana_vote_program::vote_state] 8oRw7qpj6XgLGXYCDuNoTMCqoJnDd6A8LTpNyqApSfkA dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
[2020-01-09T11:57:19.474036831Z WARN  solana_vote_program::vote_state] D71JRzjPpHipt8NAWnWb3yZoXezbkGXqSf7TVCir6wvT dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
[2020-01-09T11:57:19.474242632Z WARN  solana_vote_program::vote_state] 5NH47Zk9NAzfbtqNpUtn8CQgNZeZE88aa2NRpfe7DyTD dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
[2020-01-09T11:57:19.479717878Z WARN  solana_vote_program::vote_state] 55nmQ8gdWpNW5tLPoBPsqDkLm1W24cmY5DbMMXZKSP8U dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
[2020-01-09T11:57:19.534473871Z WARN  solana_vote_program::vote_state] 9uEazQxpRTyYX1hHMMySDBWad7zb54K9PkHKeZemK2m7 dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
[2020-01-09T11:57:19.585981995Z WARN  solana_vote_program::vote_state] 4XWxphAh1Ji9p3dYMNRNtW3sbmr5Z1cvsGyJXJx5Jvfy dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
[2020-01-09T11:57:19.586563852Z WARN  solana_vote_program::vote_state] 2DvsPbbKrBaJm7SbdVvRjZL1NGCU3MwciGCoCw42fTMu dropped vote Vote { slots: [320796], hash: 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY, timestamp: None } failed to match hash 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY FPV4FNvs4FnkDJxEtP5ULP5jrQHStBzASkmnq8HnSN7t
log from BMDE5Q2fjCmbAcig1Nxkh8pxDRvoayLGXYm3mbxiNev
[2020-01-09T11:57:11.349592005Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794], hash: HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR, timestamp: None } failed to match hash HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR ANCfD4gmtdGkvYuPnJqYq4j1aGSMxNC67To3Bb2tkYAD
[2020-01-09T11:57:11.452600826Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794, 320795], hash: E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT, timestamp: None } failed to match hash E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT AUDKk5DfCEYHKswaA9gmuSQeMw8auKx7wyPoWbTEToWN
[2020-01-09T11:57:15.781688866Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794, 320795], hash: E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT, timestamp: None } failed to match hash E9KNqBXYaxoMGUUq5CoEiFFdJYAwDumx49Ze7No2BEbT AUDKk5DfCEYHKswaA9gmuSQeMw8auKx7wyPoWbTEToWN
[2020-01-09T11:57:15.917492328Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320794], hash: HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR, timestamp: None } failed to match hash HfChLShizVffa4beRFtqU1YB7i2AzuqBq46ju5s7iXiR ANCfD4gmtdGkvYuPnJqYq4j1aGSMxNC67To3Bb2tkYAD
[2020-01-09T11:57:17.314269734Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320796], hash: 8oWuB7N74wPbZKKwbBiPLCTU8KuLnWgL6hqVy9NdEc8n, timestamp: None } failed to match hash 8oWuB7N74wPbZKKwbBiPLCTU8KuLnWgL6hqVy9NdEc8n 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY
[2020-01-09T11:57:17.606565290Z WARN  solana_vote_program::vote_state] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af dropped vote Vote { slots: [320796, 320797, 320798], hash: 7Lw4tE5tgpiA7wAV8wfjFKnNzi4aYnzB5ReotBm3C93K, timestamp: None } failed to match hash 7Lw4tE5tgpiA7wAV8wfjFKnNzi4aYnzB5ReotBm3C93K EeiQ4ccromw2jiCZUsxRhbMoXwnrZrX8nVpt82quLpzq
[2020-01-09T11:57:17.607383493Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320796, 320797], hash: 3aDTq4Qn4cVzV9G3YiB6ZTCkMFA7vMJE5AyrHDtkaNe9, timestamp: None } failed to match hash 3aDTq4Qn4cVzV9G3YiB6ZTCkMFA7vMJE5AyrHDtkaNe9 GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX
[2020-01-09T11:57:17.674824826Z WARN  solana_vote_program::vote_state] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af dropped vote Vote { slots: [320796, 320797], hash: 51CADCFic2Rpwg3TDKPUnQshCJK9v52QeT2mng9Xacb9, timestamp: None } failed to match hash 51CADCFic2Rpwg3TDKPUnQshCJK9v52QeT2mng9Xacb9 GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX
[2020-01-09T11:57:17.901442932Z WARN  solana_vote_program::vote_state] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af dropped vote Vote { slots: [320796, 320797, 320798, 320799], hash: DZ4SK4f1KrH86rRyAM7SZyUib4kwu42WsUekjv61S2Qz, timestamp: None } failed to match hash DZ4SK4f1KrH86rRyAM7SZyUib4kwu42WsUekjv61S2Qz 8ze7LPqL1gnVC4v9kkumiNKC5qYagrb9uq51HUPpCkFv
[2020-01-09T11:57:18.318380782Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320796, 320797, 320798], hash: 5tcR8W4epCp9Yo9vdXt9ZMuU9ibCsDKtFyoA3cVXfkyJ, timestamp: None } failed to match hash 5tcR8W4epCp9Yo9vdXt9ZMuU9ibCsDKtFyoA3cVXfkyJ EeiQ4ccromw2jiCZUsxRhbMoXwnrZrX8nVpt82quLpzq
[2020-01-09T11:57:18.828936897Z WARN  solana_vote_program::vote_state] F7FgS6rrWckgC5X4cP5WtRRp3U1u12nnuTRXbWYaKn1u dropped vote Vote { slots: [320796], hash: 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx, timestamp: None } failed to match hash 5qouLcHzt8FWpwKiXm8mJc5gn4GDzG5gevyJiJ7NB3zx 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY
[2020-01-09T11:57:18.834951144Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320797, 320798, 320799], hash: H1YnfNmsnR3Wbxvf8sTRuMm6EPaRzTBgftJ2Gi3hogDr, timestamp: None } failed to match hash H1YnfNmsnR3Wbxvf8sTRuMm6EPaRzTBgftJ2Gi3hogDr 8ze7LPqL1gnVC4v9kkumiNKC5qYagrb9uq51HUPpCkFv
[2020-01-09T11:57:19.620958991Z WARN  solana_vote_program::vote_state] 3Z5XVczCTXeYeFABoeFm1LngC9657kZMVGNFzqFXviHb dropped vote Vote { slots: [320796], hash: BcHruyKFfJFh2SbcTuvQM2s4iFXhDWkbnXfaCEAmf2X8, timestamp: None } failed to match hash BcHruyKFfJFh2SbcTuvQM2s4iFXhDWkbnXfaCEAmf2X8 9VE77ZuWKDcvjMXz6zGaZj2pqfKzQv2eM6ArxDhaZgAY
[2020-01-09T11:57:19.621140513Z WARN  solana_vote_program::vote_state] 47UuTGPAQZX2HnVcfxKk8b1BtA4rRTduVaHnvxzQe6AJ dropped vote Vote { slots: [320799, 320800], hash: CJCVH3mtJcQGJfJJdmxV7njrPHC3eyHCtkpj5QLnNMPk, timestamp: None } failed to match hash CJCVH3mtJcQGJfJJdmxV7njrPHC3eyHCtkpj5QLnNMPk EaCPHBPVBwwVPFJaSJqnQjUhpA4r6ztVZ3Ex6TLNXrGP
[2020-01-09T11:57:19.883711035Z WARN  solana_vote_program::vote_state] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af dropped vote Vote { slots: [320800], hash: 35bSqxZv6SVXBJfpPVaVDBTv9YhuNziHggMUpQWJPoaM, timestamp: None } failed to match hash 35bSqxZv6SVXBJfpPVaVDBTv9YhuNziHggMUpQWJPoaM EaCPHBPVBwwVPFJaSJqnQjUhpA4r6ztVZ3Ex6TLNXrGP
[2020-01-09T11:57:19.883931481Z WARN  solana_vote_program::vote_state] 3Z5XVczCTXeYeFABoeFm1LngC9657kZMVGNFzqFXviHb dropped vote Vote { slots: [320796, 320797], hash: Dt37WA3zFpxsDihHQG1QWA5xsLuRBrAQ27duYTiqUP3t, timestamp: None } failed to match hash Dt37WA3zFpxsDihHQG1QWA5xsLuRBrAQ27duYTiqUP3t GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX
[2020-01-09T11:57:20.467817287Z WARN  solana_vote_program::vote_state] F7FgS6rrWckgC5X4cP5WtRRp3U1u12nnuTRXbWYaKn1u dropped vote Vote { slots: [320796, 320797], hash: B3gRZSNcD6g4fGzUQkQfwVhMAiVMBGaVGXQcdy9ZV13o, timestamp: None } failed to match hash B3gRZSNcD6g4fGzUQkQfwVhMAiVMBGaVGXQcdy9ZV13o GXMfBvVt879CcVcDzyFL9fG6RmnoRCX2pypp8BBiRPqX
[2020-01-09T11:57:20.470123309Z WARN  solana_vote_program::vote_state] 3Z5XVczCTXeYeFABoeFm1LngC9657kZMVGNFzqFXviHb dropped vote Vote { slots: [320796, 320797, 320798], hash: Awpx77TnV2oTi5vxgaEyvbMQhPeKzssoAUTsrrhXP8dc, timestamp: None } failed to match hash Awpx77TnV2oTi5vxgaEyvbMQhPeKzssoAUTsrrhXP8dc EeiQ4ccromw2jiCZUsxRhbMoXwnrZrX8nVpt82quLpzq

@sakridge
Copy link
Member

@sakridge Thanks for checking out! So, it seems that these log patterns are harmless.

Another suspicious error is:

log from 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af:
[2020-01-09T11:57:09.623904040Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
[2020-01-09T11:57:09.623915739Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
[2020-01-09T11:57:09.623924490Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }

This occurs a lot only on 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af; and it seems there is no corresponding error at the other side validator. Does the requesting validator can really correctly handle error?

It just means when he sent the repair response that some kind of error was encountered when trying to send a packet. I'm not really sure exactly what can cause that, maybe a full udp buffer or some kind of network driver/kernel issue. But the requesting node would ask other nodes to also send it to him at some point, so it shouldn't be fatal.

Maybe if every node on the network somehow got into this state where they could not send anything, but that doesn't seem to be the case.

I would like to understand better about what can cause this.

@ryoqun
Copy link
Contributor

ryoqun commented Jan 14, 2020

Admittedly, this was hard to repro.

However, finally and finally it seems that I could manged to repro this locally.... (stay tuned)

[2020-01-14T09:37:40.088743523Z INFO  solana_core::replay_stage] new fork:1766 parent:1765 root:1734
thread '<unnamed>' panicked at 'ADvf3zXD87FfKV29mFCvnXiB7qMQRcQmQCET2qh4Vh74 dropped vote Vote { slots: [1765], hash: pRJynJGswe7gf2VryGfBFNVQvwgppMi3kCCqEqff2DQ, timestamp: None } failed to match hash pRJynJGswe7gf2VryGfBFNVQvwgppMi3kCCqEqff2DQ ZVuqHYbaMbbC3KAggriST6kyPPqQstT2dsAR9ZGScdM', programs/vote/src/vote_state.rs:237:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

@ryoqun
Copy link
Contributor

ryoqun commented Jan 16, 2020

@sakridge

Another suspicious error is:

[2020-01-09T11:57:09.623904040Z INFO  solana_core::repair_service] 4Bx5bzjmPrU1g74AHfYpTMXvspBt8GnvZVQW3ba9z4Af repair req send_to(XXX.XXX.XXX.XXX:XXX) error Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }

I would like to understand better about what can cause this.

Your guess for the cause is correct; I elaborated on this a bit here: #7840 :)

@ryoqun
Copy link
Contributor

ryoqun commented Jan 20, 2020

Here is the long-waited test steps to reproduce this scary bank hash mismatch bug. :)

Basic idea is that we must focus to stress-test the banking code as much as possible while excluding other subsystems. To that purpose, I was forced to disable any of signature generation and verification and to increase the banking threads to somewhat insane number. As this bug was discovered in this way, I think it's generally worthwhile to maintain these development flags.

I know beefy machines with CUDA might be suffice but debugging without the lldb-unfriendly release build with a plain laptop without CUDA is just too handy to miss and attractive for me. :) I'll bet this will be useful for spotting race conditions in the future.

Steps To Reproduce:

  1. Apply not-merged PRs: Introduce --dummy-sig to solana-bench-tps #6575 (for --dummy-sig) and Add --dev-no-sigsign to disable signing shreds #7876 (for --dev-no-sigsign)
  2. Create a ledger just with 2 equally-staked validators
  3. Run solana-validators each with SOLANA_BANKING_THREADS=1000, --dev-no-sigveriry and --dev-no-sigsign
  4. Run solana-bench-tps with --dummy-sig and --duration=120
  5. Result:
    1. (fail; without Fix cluster collapse due to no proper shifted read #7797). Observe the bank hash mismatch error and the cluster died.
    2. (pass; with Fix cluster collapse due to no proper shifted read #7797). The cluster goes to progress normally after the overloaded TXes are all processed.

Once I manged to reproduce this bug locally, I could narrowed the bug down step by step, ultimately resulting in this PR (#7797). Without these PRs, it's really hard to pin down the suspicious subsystem unless we know it's the AccountsDB. ;)

@mvines @danpaul000 Once merged, I think we should also add a test scenario like this STR as one of our daily bench/sanity tests. So that we should be able to notice any race-condition regressions early, which tend to be very hard to debug. Any inputs regarding this?

FYI: @carllin (as I was questioned for the STR; here is the meat!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants