-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: add page_trace
API for debugging
#10293
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
.await?; | ||
|
||
let (page_trace, mut trace_rx) = PageTrace::new(event_limit); | ||
timeline.page_trace.store(Arc::new(Some(page_trace))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this error if there's already a trace in progress?
// Above code is infallible, so we guarantee to switch the trace off when done | ||
timeline.page_trace.store(Arc::new(None)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could also stream to the client, and cancel if the client goes away.
pub(crate) fn new( | ||
size_limit: u64, | ||
) -> (Self, tokio::sync::mpsc::UnboundedReceiver<PageTraceEvent>) { | ||
let (trace_tx, trace_rx) = tokio::sync::mpsc::unbounded_channel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could also use a buffered channel with the max size here, to avoid the size accounting.
7260 tests run: 6894 passed, 0 failed, 366 skipped (full report)Flaky tests (2)Postgres 17
Postgres 15
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
2b8b0f7 at 2025-01-09T12:01:20.878Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
I think this is safe to deploy, barring the check_permission
problem.
Nits can be addressed in a follow-up.
|
||
let size_limit = | ||
parse_query_param::<_, u64>(&request, "size_limit_bytes")?.unwrap_or(1024 * 1024); | ||
let time_limit_secs = parse_query_param::<_, u64>(&request, "time_limit_secs")?.unwrap_or(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why not parse a humantime::Duration
?
pageserver/src/http/routes.rs
Outdated
loop { | ||
let timeout = deadline.saturating_duration_since(Instant::now()); | ||
tokio::select! { | ||
event = trace_rx.recv() => { | ||
buffer.extend(bincode::serialize(&event).unwrap()); | ||
|
||
if buffer.len() >= size_limit as usize { | ||
// Size threshold reached | ||
break; | ||
} | ||
} | ||
_ = tokio::time::sleep(timeout) => { | ||
// Time threshold reached | ||
break; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: instead of doing a repeat select!(), I think it's better style to declare one async
block that does the loop { trace_rx.recv().await; }
, then poll that block inside a timeout.
Roughly like so:
tokio::time::timeout(time_limit_secs, async {
loop {
let event = trace_rx.recv().await;
...
}
}).await;
pageserver/src/http/routes.rs
Outdated
event = trace_rx.recv() => { | ||
buffer.extend(bincode::serialize(&event).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first thought event
is always Ok()
but it isn't if this handler is called concurrently on the same timeline.
We should
- be only writing the Ok() value to the buffer and
- bail out of the loop as soon as recv() fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to busyloop if the timeline is dropped, but seems fine to deploy temporarily for now.
Problem
When a pageserver is receiving high rates of requests, we don't have a good way to efficiently discover what the client's access pattern is.
Closes: #10275
Summary of changes
/v1/tenant/x/timeline/y/page_trace?size_limit_bytes=...&time_limit_secs=...
API, which returns a binary buffer. Tool to decode and report on the output will follow separately