mining instability issue tracker #11251

jennijuju · 2023-09-06T23:17:37Z

Note: Currently, this happens occasionally and we are not seeing a consistent network wide issue yet. The team monitoring closely. If you are impacted by this issue constantly, please let us know in #fil-lotus-help. I'd recommend those who observes an duration increase in their block generation to lower PROPAGATION_DELAY_SECS to 6-8 sec.

on Aug 28th, @beck-8 reported that the network is less stable with small forks, and storage providers are observing delay in block generation and an increase in missing winning blocks..

Unfortunately, we didn't get a chance to get to #10888 (other than message execution profiling #10892 ) work yet, in which aims to develop better monitoring, tooling and metrics to help mining node operators to collect more detail and gain more visibility into the mining executions when an issue occur. Thus, currently, while we have some guess into what could be the cause, we are not so sure.

We are creating this issue to track the investigation and development effort. @fridrik01 will add on more details plans/update later. For now, the info we have is

it seems like a MpoolSelect is taking longer when mpool is packed which impacts winningPoSt. @rjan90 suspected that we might have some high contention/waiting locks in mpool logics.
@rjan90 was able to produce the issue with lotus-shed mpool miner-select-messages , see findings here.
@marco-storswift is also observing some tPendings could take >30s.

Potential Todos:

Monitor performance of MpoolSelect #11233
review and land perf: mempool: rpc lock contention - remove lock proposal #10787
maybe Messagepool lower priority optimizations - checkMessages refactor #10692 ?
maybe Messagepool lower priority optimizations #10597 ?
lotus miner team should prioritize Bugs / Improvements to the mineOne loop #10754 (cc @snadrus )

The text was updated successfully, but these errors were encountered:

fridrik01 · 2023-09-13T14:32:20Z

TL;DR My current theory on this, is that its related to splitstore pruning logic which seems to correlate exactly with the high latency of MpoolSelect.

Here are updates after past 2 days debugging this issue:

I setup grafana/prometheus and installed the node_exporter and ran this to collect system metrics.
I added more logging in the TipSetExecutor::ApplyBlocks method too see where inside that method time was spent.
I created a new simple benchmark tool for calling binaries and analyzing their latencies/distribution (similar to lotus-bench rpc expect for binaries).
I then used this tool to call miner-select-messages sequentially overnight using (lotus-bench cli --cmd=lotus-shed mpool miner-select-messages --concurrency=1 --qps=1 --watch=5s --duration=24h)
After running this overnight, I observed 2 cases where latency of this command took 15+s (see image)
Looking at the grafana dashboards I saw two large disk write spikes (500MB/s) and looked at the lotus logs that showed the additional logging in 2).
In both cases, these large disk write spikes happened where we had 15+sec latencies.
Looking around the logs I saw that in both these spikes, the splitstore was doing compaction.
Then observed that this compaction happens approx every 7h:30m and each time we observed high latency spikes
The compaction logic that may to be causing purging step which holds the `txnLk" for writing potentially causing any read access to block.

Also collecting notes in: https://docs.google.com/document/d/1ZxTJ5mQi6tezGlF6FFWAZRTCOf2PsinmjPZ9INnaUe0/edit

This is where I am now.

My theory is that calling vmi.ApplyMessage inside ApplyBlocks which runs the fvm is then calling the blockstore externally which is waiting for a lock held by the pruning logic of the splitstore.

beck-8 · 2023-09-14T05:58:39Z

I think this is just one situation, there are several types:

The problem you mentioned caused by splitstore (I agree with this. I have found that this function often causes synchronization problems before, so I will not use it in the production environment)
Too many messages in the message pool cause MpoolSelect to be slow

fridrik01 · 2023-09-18T10:58:37Z

Update:

Uploaded a PR which fixes the high latency when splitstore is enabled: fix: Add time slicing to splitstore purging to reduce lock congestion #11269
@beck-8 is not using splitstore so there seems to be another underlying issue, probably correlated with size of message pool as mentioned.
Added more metrics and now log goroutines that are waiting for a long time for I/O from badger blockstsore
I ran lotus node over the weekend with splitstore disabled and the slowest mpool select (and tipset execution) I saw was 1.7sec (see more details in this doc) but I did not see a large mpool size over the weekend.

fridrik01 · 2023-09-28T10:42:24Z

Update:

I have been running the Lotus node described in my previous update for over a week now but have not noticed any stability issues whatsoever.
I reached out to @beck-8 and he has not noticed any issues as well
mpool size has been low throughout this time
I will continue to run my node in the background while working on other stuff, but we should consider adding a benchmark to Lotus that tests mpool functionality under load (opened Add benchmark for testing mpool functionality under load #11304)

jennijuju assigned fridrik01 Sep 6, 2023

jennijuju added the P1 P1: Must be resolved label Sep 6, 2023

This was referenced Sep 14, 2023

fix: Add time slicing to splitstore purging to reduce lock congestion #11269

Merged

feat: Add lotus-bench cli option to stress test any binary #11270

Merged

rjan90 mentioned this issue Sep 19, 2023

Monitor performance of MpoolSelect #11233

Closed

fridrik01 mentioned this issue Sep 28, 2023

Add benchmark for testing mpool functionality under load #11304

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mining instability issue tracker #11251

mining instability issue tracker #11251

jennijuju commented Sep 6, 2023 •

edited by rjan90

Loading

fridrik01 commented Sep 13, 2023

beck-8 commented Sep 14, 2023

fridrik01 commented Sep 18, 2023

fridrik01 commented Sep 28, 2023

mining instability issue tracker #11251

mining instability issue tracker #11251

Comments

jennijuju commented Sep 6, 2023 • edited by rjan90 Loading

fridrik01 commented Sep 13, 2023

beck-8 commented Sep 14, 2023

fridrik01 commented Sep 18, 2023

fridrik01 commented Sep 28, 2023

jennijuju commented Sep 6, 2023 •

edited by rjan90

Loading