Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNM Add metrics about the last time a region is sent a snapshot #9429

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

CalvinNeo
Copy link
Member

What problem does this PR solve?

Issue Number: ref #9241

Problem Summary:

The idea is when a region got stuck, TiKV leader could send it a snapshot. We will record how many time has passed since the last snapshot. If the time is short, then the region may have some problems here.

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

a
Signed-off-by: Calvin Neo <[email protected]>
@ti-chi-bot ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Sep 13, 2024
Copy link
Contributor

ti-chi-bot bot commented Sep 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from calvinneo, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 13, 2024
@CalvinNeo
Copy link
Member Author

CalvinNeo commented Sep 13, 2024

It is hard to reproduce when a snapshot on a pending peer.
image

Also:
Apply snapshot time analysis
image

Signed-off-by: Calvin Neo <[email protected]>
@@ -351,7 +351,7 @@ void KVStore::onSnapshot(

tmt.getRegionTable().shrinkRegionRange(*new_region);
}

new_region->updateSnapshotAppliedTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always create a new_region instance to apply snapshot instead of modifying the old_region in-place. So I think the tiflash_raft_long_term_event_duration_seconds, type_apply_snapshot_gap is not get reported at all?

Copy link
Member Author

@CalvinNeo CalvinNeo Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have tried to fix this by fetch the lastSNapshotAppliedTime() from the old_region(if any) to compute. However, it is still hard to observe the metric in real tests. Because the only stable way to make TiFlash lag is to kill it. However, if TiFlash restarts, the lastSNapshotAppliedTime() will be set to 0.

Signed-off-by: Calvin Neo <[email protected]>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 14, 2024
Signed-off-by: Calvin Neo <[email protected]>
f
Signed-off-by: Calvin Neo <[email protected]>
@@ -69,6 +69,8 @@ class Stopwatch
is_running = true;
}

UInt64 getStartMillis() { return start_ns / 1000000UL; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abbreviating milliseconds to millis seem a bit strange.
getStartMS may be more clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use Milliseconds because other methods use this too...

@@ -42,6 +42,7 @@ struct CheckpointIngestInfo
UInt64 regionId() const { return region_id; }
UInt64 peerId() const { return peer_id; }
UInt64 beginTime() const { return begin_time; }
UInt64 createdTime() const { return begin_time; }
Copy link
Contributor

@JaySon-Huang JaySon-Huang Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be return created_time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add comments about what is the difference between beginTime and createdTime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this PR back to DNM, because some new functionalities is added into this PR.

@CalvinNeo CalvinNeo changed the title Add metrics about the last time a region is sent a snapshot DNM Add metrics about the last time a region is sent a snapshot Sep 20, 2024
@CalvinNeo
Copy link
Member Author

After #9434

Signed-off-by: Calvin Neo <[email protected]>
Copy link
Contributor

ti-chi-bot bot commented Oct 15, 2024

@CalvinNeo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test 0b438bc link true /test pull-unit-test
pull-integration-test 0b438bc link true /test pull-integration-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2024
Copy link
Contributor

ti-chi-bot bot commented Oct 18, 2024

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants