-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #65881
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/sql/row/kv_fetcher.go, line 123 at r1 (raw file):
} if len(f.batchResponse) == 0 { f.batchResponse = nil
super nit: might be worth leaving a quick comment about this.
The KVFetcher is the piece of code that does the first level of decoding of a KV batch response, doling out slices of keys and values to higher level code that further decodes the key values into formats that the SQL engine can operate on. The KVFetcher uses a slice into the batch response to keep track of where it is during the decoding process. Once the slice is empty, it's finished until someone asks it for a new batch. However, the KVFetcher used to keep around that empty slice pointer for its lifetime, or until it was asked for a new batch. This causes the batch response to be un-garbage-collectable, since there is still a slice pointing at it, even though the slice is empty. This causes queries to use up to 2x their accounted-for batch memory, since the memory accounting system assumes that once data is transfered out of the batch response into the SQL representation, the batch response is freed - it assumes there's just 1 "copy" of this batch response memory. This is especially problematic for long queries (since they will not allow that KVFetcher memory to be freed until they're finished). In effect, this causes 1 extra batch per KVFetcher per query to be retained in memory. This doesn't sound too bad, since a batch is of fixed size. But the max batch size is 1 megabyte, so with 1024 concurrent queries, each with 3 KVFetchers, like we see in a TPCH workload with 1024 concurrent query 18s, that's 1024 * 1MB * 3 = 3GB of unaccounted for memory. This is easily enough memory to push a node over and cause it to OOM. This patch nils the batch response pointer once the KVFetcher is finished decoding it, which allows it to be garbage collected as soon as possible. In practice, this seems to allow at least a single-node concurrency-1024 query18 TPCH workload to survive indefinitely (all queries return out of budget errors) without OOMing. Release note (bug fix): queries use up to 1MB less actual system memory per scan, lookup join, index join, zigzag join, or inverted join in their query plans. This will result in improved memory performance for workloads with concurrent OLAP-style queries.
1f572c0
to
876dc0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @mgartner)
pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):
f.responses = f.responses[1:] origSpan := f.requestSpans[0] f.requestSpans[0] = roachpb.Span{}
nit: pull out this popping pattern into helper functions like f.popFromRemainingBatches
,f.popFromResponses
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @mgartner)
pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: pull out this popping pattern into helper functions like
f.popFromRemainingBatches
,f.popFromResponses
, etc.
I don't see a way to reduce any code by doing this, since each use is subtly different. I guess I could see an argument for making helper functions that are used just once, but I'm not sure I buy it. But, any suggestions for making this subtle GC poking less subtle are appreciated.
pkg/sql/row/kv_batch_fetcher.go, line 545 at r4 (raw file):
is this really necessary since we just assign the whole slice to nil on the next line? that's surprising... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @mgartner, and @rytaft)
pkg/sql/row/kv_batch_fetcher.go, line 545 at r4 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
is this really necessary since we just assign the whole slice to nil on the next line? that's surprising...
It's necessary because the underlying t.BatchResponses
slice is retained by f.remainingBatches
. Even though we assign f.remainingBatches
to t.BatchResponses[1:]
, the 0th element will still be retained from the perspective of the garbage collector unless we nil it out.
Perhaps a more obvious way to phrase this would be to say:
f.remainingBatches = f.BatchResponses
f.remainingBatches[0] = nil
f.remainingBatches = f.remainingBatches[1:]
Thoughts?
pkg/sql/row/kv_batch_fetcher.go, line 545 at r4 (raw file): Previously, jordanlewis (Jordan Lewis) wrote…
Oh didn't catch that. Yea that change might be clearer. Thanks! |
pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file): Previously, jordanlewis (Jordan Lewis) wrote…
Ya there would be no code reduction. I was thinking it would make this function less dense - you have to grok a few action-packed lines to understand that these are simply pop operations. Separating the mechanics of the pop might aid in legibility. Feel free to leave as-is! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @mgartner)
pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Ya there would be no code reduction. I was thinking it would make this function less dense - you have to grok a few action-packed lines to understand that these are simply pop operations. Separating the mechanics of the pop might aid in legibility. Feel free to leave as-is!
A popRemainingBatches
function would be used at least three times. Below we would just do f.remainingBatches, t.BatchResponses = t.BatchResponses, nil
before calling it.
876dc0c
to
3bb865d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Previously, we could leave some dangling references to batch responses around in the txnKVFetcher when we were fetching more than one batch at a time. This would cause a delay in reclamation of memory for the lifetime of a given query. Release note (bug fix): use less memory in some queries, primarily lookup joins.
3bb865d
to
d6d394b
Compare
TFTRs, I was able to extract a little helper. bors r+ |
Build failed (retrying...): |
@jordanlewis I added backport labels, because I'm assuming this should be backported to as many versions as possible. |
Build succeeded: |
Updates #64906. The critical change is the first patch, the kvfetcher one.
The kvbatchfetcher one is theoretically good but I haven't found it to make
as large of a difference as the first patch.
With the first patch applied alone, I can no longer cause OOM conditions with
1024 concurrent TPCH query 18s sent at a single machine, which is a major
improvement. Prior to that patch, such a workload would overwhelm the machine
within 1-2 minutes.
This bug was found with help of the new tooling I've been adding to
viewcore
,mostly the new
pprof
output format, the existing html object explorer, and thenew type explorer. You can see these updates at
https://github.com/jordanlewis/debug/tree/crl-stuff.
The KVFetcher is the piece of code that does the first level of decoding
of a KV batch response, doling out slices of keys and values to higher
level code that further decodes the key values into formats that the SQL
engine can operate on.
The KVFetcher uses a slice into the batch response to keep track of
where it is during the decoding process. Once the slice is empty, it's
finished until someone asks it for a new batch.
However, the KVFetcher used to keep around that empty slice pointer for
its lifetime, or until it was asked for a new batch. This causes the
batch response to be un-garbage-collectable, since there is still a
slice pointing at it, even though the slice is empty.
This causes queries to use up to 2x their accounted-for batch memory,
since the memory accounting system assumes that once data is transfered
out of the batch response into the SQL representation, the batch
response is freed - it assumes there's just 1 "copy" of this batch
response memory.
This is especially problematic for long queries (since they will not
allow that KVFetcher memory to be freed until they're finished).
In effect, this causes 1 extra batch per KVFetcher per query to be
retained in memory. This doesn't sound too bad, since a batch is of
fixed size. But the max batch size is 1 megabyte, so with 1024
concurrent queries, each with 3 KVFetchers, like we see in a TPCH
workload with 1024 concurrent query 18s, that's 1024 * 1MB * 3 = 3GB of
unaccounted for memory. This is easily enough memory to push a node over
and cause it to OOM.
This patch nils the batch response pointer once the KVFetcher is
finished decoding it, which allows it to be garbage collected as soon as
possible. In practice, this seems to allow at least a single-node
concurrency-1024 query18 TPCH workload to survive indefinitely (all
queries return out of budget errors) without OOMing.
Release note (bug fix): queries use up to 1MB less actual system memory
per scan, lookup join, index join, zigzag join, or inverted join in
their query plans. This will result in improved memory performance for
workloads with concurrent OLAP-style queries.
Previously, we could leave some dangling references to batch responses
around in the txnKVFetcher when we were fetching more than one batch at
a time. This would cause a delay in reclamation of memory for the
lifetime of a given query.
Release note (bug fix): use less memory in some queries, primarily
lookup joins.