-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNM Add metrics about the last time a region is sent a snapshot #9429
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Calvin Neo <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Calvin Neo <[email protected]>
@@ -351,7 +351,7 @@ void KVStore::onSnapshot( | |||
|
|||
tmt.getRegionTable().shrinkRegionRange(*new_region); | |||
} | |||
|
|||
new_region->updateSnapshotAppliedTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always create a new_region
instance to apply snapshot instead of modifying the old_region
in-place. So I think the tiflash_raft_long_term_event_duration_seconds, type_apply_snapshot_gap
is not get reported at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have tried to fix this by fetch the lastSNapshotAppliedTime()
from the old_region(if any) to compute. However, it is still hard to observe the metric in real tests. Because the only stable way to make TiFlash lag is to kill it. However, if TiFlash restarts, the lastSNapshotAppliedTime()
will be set to 0.
Signed-off-by: Calvin Neo <[email protected]>
Co-authored-by: jinhelin <[email protected]>
Signed-off-by: Calvin Neo <[email protected]>
Signed-off-by: Calvin Neo <[email protected]>
Signed-off-by: Calvin Neo <[email protected]>
Signed-off-by: Calvin Neo <[email protected]>
@@ -69,6 +69,8 @@ class Stopwatch | |||
is_running = true; | |||
} | |||
|
|||
UInt64 getStartMillis() { return start_ns / 1000000UL; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abbreviating milliseconds
to millis
seem a bit strange.
getStartMS
may be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just use Milliseconds because other methods use this too...
@@ -42,6 +42,7 @@ struct CheckpointIngestInfo | |||
UInt64 regionId() const { return region_id; } | |||
UInt64 peerId() const { return peer_id; } | |||
UInt64 beginTime() const { return begin_time; } | |||
UInt64 createdTime() const { return begin_time; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be return created_time
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And add comments about what is the difference between beginTime
and createdTime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this PR back to DNM, because some new functionalities is added into this PR.
After #9434 |
Signed-off-by: Calvin Neo <[email protected]>
@CalvinNeo: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What problem does this PR solve?
Issue Number: ref #9241
Problem Summary:
The idea is when a region got stuck, TiKV leader could send it a snapshot. We will record how many time has passed since the last snapshot. If the time is short, then the region may have some problems here.
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note