DNM Add metrics about the last time a region is sent a snapshot #9429

CalvinNeo · 2024-09-13T14:54:11Z

What problem does this PR solve?

Issue Number: ref #9241

Problem Summary:

The idea is when a region got stuck, TiKV leader could send it a snapshot. We will record how many time has passed since the last snapshot. If the time is short, then the region may have some problems here.

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Signed-off-by: Calvin Neo <[email protected]>

ti-chi-bot · 2024-09-13T14:54:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from calvinneo, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

CalvinNeo · 2024-09-13T14:55:32Z

It is hard to reproduce when a snapshot on a pending peer.

Also:
Apply snapshot time analysis

Signed-off-by: Calvin Neo <[email protected]>

JaySon-Huang · 2024-09-14T01:55:40Z

dbms/src/Storages/KVStore/MultiRaft/ApplySnapshot.cpp

@@ -351,7 +351,7 @@ void KVStore::onSnapshot(

        tmt.getRegionTable().shrinkRegionRange(*new_region);
    }
-
+    new_region->updateSnapshotAppliedTime();


We always create a new_region instance to apply snapshot instead of modifying the old_region in-place. So I think the tiflash_raft_long_term_event_duration_seconds, type_apply_snapshot_gap is not get reported at all?

Yes, I have tried to fix this by fetch the lastSNapshotAppliedTime() from the old_region(if any) to compute. However, it is still hard to observe the metric in real tests. Because the only stable way to make TiFlash lag is to kill it. However, if TiFlash restarts, the lastSNapshotAppliedTime() will be set to 0.

Signed-off-by: Calvin Neo <[email protected]>

dbms/src/Storages/KVStore/Region.cpp

Co-authored-by: jinhelin <[email protected]>

Signed-off-by: Calvin Neo <[email protected]>

…d-last-snapshot

Signed-off-by: Calvin Neo <[email protected]>

JinheLin · 2024-09-19T05:58:01Z

dbms/src/Common/Stopwatch.h

@@ -69,6 +69,8 @@ class Stopwatch
        is_running = true;
    }

+    UInt64 getStartMillis() { return start_ns / 1000000UL; }


Abbreviating milliseconds to millis seem a bit strange.
getStartMS may be more clear.

I would just use Milliseconds because other methods use this too...

JaySon-Huang · 2024-09-19T07:26:03Z

dbms/src/Storages/KVStore/MultiRaft/Disagg/CheckpointIngestInfo.h

@@ -42,6 +42,7 @@ struct CheckpointIngestInfo
    UInt64 regionId() const { return region_id; }
    UInt64 peerId() const { return peer_id; }
    UInt64 beginTime() const { return begin_time; }
+    UInt64 createdTime() const { return begin_time; }


Should be return created_time?

And add comments about what is the difference between beginTime and createdTime

Changed this PR back to DNM, because some new functionalities is added into this PR.

CalvinNeo · 2024-09-20T07:26:33Z

After #9434

Signed-off-by: Calvin Neo <[email protected]>

ti-chi-bot · 2024-10-15T08:43:08Z

@CalvinNeo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test	`0b438bc`	link	true	`/test pull-unit-test`
pull-integration-test	`0b438bc`	link	true	`/test pull-integration-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ti-chi-bot · 2024-10-18T10:03:25Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

a

3c9da40

Signed-off-by: Calvin Neo <[email protected]>

ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Sep 13, 2024

ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 13, 2024

CalvinNeo requested review from JaySon-Huang, Lloyd-Pottiger and JinheLin September 13, 2024 15:19

bucket

a08cfd8

Signed-off-by: Calvin Neo <[email protected]>

JaySon-Huang reviewed Sep 14, 2024

View reviewed changes

fix format

75de7f5

Signed-off-by: Calvin Neo <[email protected]>

JinheLin reviewed Sep 14, 2024

View reviewed changes

dbms/src/Storages/KVStore/Region.cpp Outdated Show resolved Hide resolved

CalvinNeo and others added 4 commits September 14, 2024 11:13

Update dbms/src/Storages/KVStore/Region.cpp

88bfd75

Co-authored-by: jinhelin <[email protected]>

fix

7c37204

Signed-off-by: Calvin Neo <[email protected]>

Merge branch 'add-last-snapshot' of github.com:CalvinNeo/tics into ad…

edf4283

…d-last-snapshot

fix

1f56832

Signed-off-by: Calvin Neo <[email protected]>

ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 14, 2024

CalvinNeo added 2 commits September 18, 2024 23:36

fix

98a5f40

Signed-off-by: Calvin Neo <[email protected]>

f

95e8f38

Signed-off-by: Calvin Neo <[email protected]>

JinheLin reviewed Sep 19, 2024

View reviewed changes

JaySon-Huang reviewed Sep 19, 2024

View reviewed changes

CalvinNeo changed the title ~~Add metrics about the last time a region is sent a snapshot~~ DNM Add metrics about the last time a region is sent a snapshot Sep 20, 2024

enhance some comments

0b438bc

Signed-off-by: Calvin Neo <[email protected]>

ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNM Add metrics about the last time a region is sent a snapshot #9429

DNM Add metrics about the last time a region is sent a snapshot #9429

CalvinNeo commented Sep 13, 2024

ti-chi-bot bot commented Sep 13, 2024

CalvinNeo commented Sep 13, 2024 •

edited

Loading

JaySon-Huang Sep 14, 2024

CalvinNeo Sep 14, 2024 •

edited

Loading

JinheLin Sep 19, 2024

CalvinNeo Sep 20, 2024

JaySon-Huang Sep 19, 2024 •

edited

Loading

JaySon-Huang Sep 19, 2024

CalvinNeo Sep 20, 2024

CalvinNeo commented Sep 20, 2024

ti-chi-bot bot commented Oct 15, 2024

ti-chi-bot bot commented Oct 18, 2024

DNM Add metrics about the last time a region is sent a snapshot #9429

Are you sure you want to change the base?

DNM Add metrics about the last time a region is sent a snapshot #9429

Conversation

CalvinNeo commented Sep 13, 2024

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot bot commented Sep 13, 2024

CalvinNeo commented Sep 13, 2024 • edited Loading

JaySon-Huang Sep 14, 2024

Choose a reason for hiding this comment

CalvinNeo Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

JinheLin Sep 19, 2024

Choose a reason for hiding this comment

CalvinNeo Sep 20, 2024

Choose a reason for hiding this comment

JaySon-Huang Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

JaySon-Huang Sep 19, 2024

Choose a reason for hiding this comment

CalvinNeo Sep 20, 2024

Choose a reason for hiding this comment

CalvinNeo commented Sep 20, 2024

ti-chi-bot bot commented Oct 15, 2024

ti-chi-bot bot commented Oct 18, 2024

CalvinNeo commented Sep 13, 2024 •

edited

Loading

CalvinNeo Sep 14, 2024 •

edited

Loading

JaySon-Huang Sep 19, 2024 •

edited

Loading