Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mining instability issue tracker #11251

Open
1 of 5 tasks
jennijuju opened this issue Sep 6, 2023 · 4 comments
Open
1 of 5 tasks

mining instability issue tracker #11251

jennijuju opened this issue Sep 6, 2023 · 4 comments
Assignees
Labels
P1 P1: Must be resolved

Comments

@jennijuju
Copy link
Member

jennijuju commented Sep 6, 2023

Note: Currently, this happens occasionally and we are not seeing a consistent network wide issue yet. The team monitoring closely. If you are impacted by this issue constantly, please let us know in #fil-lotus-help. I'd recommend those who observes an duration increase in their block generation to lower PROPAGATION_DELAY_SECS to 6-8 sec.

on Aug 28th, @beck-8 reported that the network is less stable with small forks, and storage providers are observing delay in block generation and an increase in missing winning blocks..

Unfortunately, we didn't get a chance to get to #10888 (other than message execution profiling #10892 ) work yet, in which aims to develop better monitoring, tooling and metrics to help mining node operators to collect more detail and gain more visibility into the mining executions when an issue occur. Thus, currently, while we have some guess into what could be the cause, we are not so sure.

We are creating this issue to track the investigation and development effort. @fridrik01 will add on more details plans/update later. For now, the info we have is

  • it seems like a MpoolSelect is taking longer when mpool is packed which impacts winningPoSt. @rjan90 suspected that we might have some high contention/waiting locks in mpool logics.
    image
  • @rjan90 was able to produce the issue with lotus-shed mpool miner-select-messages , see findings here.
  • @marco-storswift is also observing some tPendings could take >30s.

Potential Todos:

@jennijuju jennijuju added the P1 P1: Must be resolved label Sep 6, 2023
@fridrik01
Copy link
Contributor

TL;DR My current theory on this, is that its related to splitstore pruning logic which seems to correlate exactly with the high latency of MpoolSelect.

Here are updates after past 2 days debugging this issue:

  1. I setup grafana/prometheus and installed the node_exporter and ran this to collect system metrics.
  2. I added more logging in the TipSetExecutor::ApplyBlocks method too see where inside that method time was spent.
  3. I created a new simple benchmark tool for calling binaries and analyzing their latencies/distribution (similar to lotus-bench rpc expect for binaries).
  4. I then used this tool to call miner-select-messages sequentially overnight using (lotus-bench cli --cmd=lotus-shed mpool miner-select-messages --concurrency=1 --qps=1 --watch=5s --duration=24h)
  5. After running this overnight, I observed 2 cases where latency of this command took 15+s (see image)
  6. Looking at the grafana dashboards I saw two large disk write spikes (500MB/s) and looked at the lotus logs that showed the additional logging in 2).
  7. In both cases, these large disk write spikes happened where we had 15+sec latencies.
  8. Looking around the logs I saw that in both these spikes, the splitstore was doing compaction.
  9. Then observed that this compaction happens approx every 7h:30m and each time we observed high latency spikes
  10. The compaction logic that may to be causing purging step which holds the `txnLk" for writing potentially causing any read access to block.

Also collecting notes in: https://docs.google.com/document/d/1ZxTJ5mQi6tezGlF6FFWAZRTCOf2PsinmjPZ9INnaUe0/edit

This is where I am now.

My theory is that calling vmi.ApplyMessage inside ApplyBlocks which runs the fvm is then calling the blockstore externally which is waiting for a lock held by the pruning logic of the splitstore.

@beck-8
Copy link
Contributor

beck-8 commented Sep 14, 2023

I think this is just one situation, there are several types:

  1. The problem you mentioned caused by splitstore (I agree with this. I have found that this function often causes synchronization problems before, so I will not use it in the production environment)
  2. Too many messages in the message pool cause MpoolSelect to be slow

@fridrik01
Copy link
Contributor

Update:

  • Uploaded a PR which fixes the high latency when splitstore is enabled: fix: Add time slicing to splitstore purging to reduce lock congestion #11269
  • @beck-8 is not using splitstore so there seems to be another underlying issue, probably correlated with size of message pool as mentioned.
  • Added more metrics and now log goroutines that are waiting for a long time for I/O from badger blockstsore
  • I ran lotus node over the weekend with splitstore disabled and the slowest mpool select (and tipset execution) I saw was 1.7sec (see more details in this doc) but I did not see a large mpool size over the weekend.

@fridrik01
Copy link
Contributor

Update:

  • I have been running the Lotus node described in my previous update for over a week now but have not noticed any stability issues whatsoever.
  • I reached out to @beck-8 and he has not noticed any issues as well
  • mpool size has been low throughout this time
  • I will continue to run my node in the background while working on other stuff, but we should consider adding a benchmark to Lotus that tests mpool functionality under load (opened Add benchmark for testing mpool functionality under load #11304)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 P1: Must be resolved
Projects
None yet
Development

No branches or pull requests

3 participants