row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #65881

jordanlewis · 2021-05-29T18:42:05Z

Updates #64906. The critical change is the first patch, the kvfetcher one.
The kvbatchfetcher one is theoretically good but I haven't found it to make
as large of a difference as the first patch.

With the first patch applied alone, I can no longer cause OOM conditions with
1024 concurrent TPCH query 18s sent at a single machine, which is a major
improvement. Prior to that patch, such a workload would overwhelm the machine
within 1-2 minutes.

This bug was found with help of the new tooling I've been adding to viewcore,
mostly the new pprof output format, the existing html object explorer, and the
new type explorer. You can see these updates at
https://github.com/jordanlewis/debug/tree/crl-stuff.

The KVFetcher is the piece of code that does the first level of decoding
of a KV batch response, doling out slices of keys and values to higher
level code that further decodes the key values into formats that the SQL
engine can operate on.

The KVFetcher uses a slice into the batch response to keep track of
where it is during the decoding process. Once the slice is empty, it's
finished until someone asks it for a new batch.

However, the KVFetcher used to keep around that empty slice pointer for
its lifetime, or until it was asked for a new batch. This causes the
batch response to be un-garbage-collectable, since there is still a
slice pointing at it, even though the slice is empty.

This causes queries to use up to 2x their accounted-for batch memory,
since the memory accounting system assumes that once data is transfered
out of the batch response into the SQL representation, the batch
response is freed - it assumes there's just 1 "copy" of this batch
response memory.

This is especially problematic for long queries (since they will not
allow that KVFetcher memory to be freed until they're finished).

In effect, this causes 1 extra batch per KVFetcher per query to be
retained in memory. This doesn't sound too bad, since a batch is of
fixed size. But the max batch size is 1 megabyte, so with 1024
concurrent queries, each with 3 KVFetchers, like we see in a TPCH
workload with 1024 concurrent query 18s, that's 1024 * 1MB * 3 = 3GB of
unaccounted for memory. This is easily enough memory to push a node over
and cause it to OOM.

This patch nils the batch response pointer once the KVFetcher is
finished decoding it, which allows it to be garbage collected as soon as
possible. In practice, this seems to allow at least a single-node
concurrency-1024 query18 TPCH workload to survive indefinitely (all
queries return out of budget errors) without OOMing.

Release note (bug fix): queries use up to 1MB less actual system memory
per scan, lookup join, index join, zigzag join, or inverted join in
their query plans. This will result in improved memory performance for
workloads with concurrent OLAP-style queries.

Previously, we could leave some dangling references to batch responses
around in the txnKVFetcher when we were fetching more than one batch at
a time. This would cause a delay in reclamation of memory for the
lifetime of a given query.

Release note (bug fix): use less memory in some queries, primarily
lookup joins.

cockroach-teamcity · 2021-05-29T18:42:13Z

This change is

yuzefovich

Nice finds!

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

pkg/sql/row/kv_fetcher.go, line 123 at r1 (raw file):

			}
			if len(f.batchResponse) == 0 {
				f.batchResponse = nil

super nit: might be worth leaving a quick comment about this.

The KVFetcher is the piece of code that does the first level of decoding of a KV batch response, doling out slices of keys and values to higher level code that further decodes the key values into formats that the SQL engine can operate on. The KVFetcher uses a slice into the batch response to keep track of where it is during the decoding process. Once the slice is empty, it's finished until someone asks it for a new batch. However, the KVFetcher used to keep around that empty slice pointer for its lifetime, or until it was asked for a new batch. This causes the batch response to be un-garbage-collectable, since there is still a slice pointing at it, even though the slice is empty. This causes queries to use up to 2x their accounted-for batch memory, since the memory accounting system assumes that once data is transfered out of the batch response into the SQL representation, the batch response is freed - it assumes there's just 1 "copy" of this batch response memory. This is especially problematic for long queries (since they will not allow that KVFetcher memory to be freed until they're finished). In effect, this causes 1 extra batch per KVFetcher per query to be retained in memory. This doesn't sound too bad, since a batch is of fixed size. But the max batch size is 1 megabyte, so with 1024 concurrent queries, each with 3 KVFetchers, like we see in a TPCH workload with 1024 concurrent query 18s, that's 1024 * 1MB * 3 = 3GB of unaccounted for memory. This is easily enough memory to push a node over and cause it to OOM. This patch nils the batch response pointer once the KVFetcher is finished decoding it, which allows it to be garbage collected as soon as possible. In practice, this seems to allow at least a single-node concurrency-1024 query18 TPCH workload to survive indefinitely (all queries return out of budget errors) without OOMing. Release note (bug fix): queries use up to 1MB less actual system memory per scan, lookup join, index join, zigzag join, or inverted join in their query plans. This will result in improved memory performance for workloads with concurrent OLAP-style queries.

mgartner

Great find!

Reviewed 2 of 2 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @mgartner)

pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):

		f.responses = f.responses[1:]
		origSpan := f.requestSpans[0]
		f.requestSpans[0] = roachpb.Span{}

nit: pull out this popping pattern into helper functions like f.popFromRemainingBatches,f.popFromResponses, etc.

jordanlewis

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @mgartner)

pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: pull out this popping pattern into helper functions like f.popFromRemainingBatches,f.popFromResponses, etc.

I don't see a way to reduce any code by doing this, since each use is subtly different. I guess I could see an argument for making helper functions that are used just once, but I'm not sure I buy it. But, any suggestions for making this subtle GC poking less subtle are appreciated.

rytaft · 2021-06-02T15:52:41Z

pkg/sql/row/kv_batch_fetcher.go, line 545 at r4 (raw file):

				batchResp = t.BatchResponses[0]
				f.remainingBatches = t.BatchResponses[1:]
				t.BatchResponses[0] = nil

is this really necessary since we just assign the whole slice to nil on the next line? that's surprising...

jordanlewis

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @mgartner, and @rytaft)

pkg/sql/row/kv_batch_fetcher.go, line 545 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

is this really necessary since we just assign the whole slice to nil on the next line? that's surprising...

It's necessary because the underlying t.BatchResponses slice is retained by f.remainingBatches. Even though we assign f.remainingBatches to t.BatchResponses[1:], the 0th element will still be retained from the perspective of the garbage collector unless we nil it out.

Perhaps a more obvious way to phrase this would be to say:

f.remainingBatches = f.BatchResponses
f.remainingBatches[0] = nil
f.remainingBatches = f.remainingBatches[1:]

Thoughts?

rytaft · 2021-06-02T17:29:11Z

pkg/sql/row/kv_batch_fetcher.go, line 545 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

It's necessary because the underlying t.BatchResponses slice is retained by f.remainingBatches. Even though we assign f.remainingBatches to t.BatchResponses[1:], the 0th element will still be retained from the perspective of the garbage collector unless we nil it out.

Perhaps a more obvious way to phrase this would be to say:
f.remainingBatches = f.BatchResponses
f.remainingBatches[0] = nil
f.remainingBatches = f.remainingBatches[1:]
Thoughts?

Oh didn't catch that. Yea that change might be clearer. Thanks!

mgartner · 2021-06-02T17:41:36Z

pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I don't see a way to reduce any code by doing this, since each use is subtly different. I guess I could see an argument for making helper functions that are used just once, but I'm not sure I buy it. But, any suggestions for making this subtle GC poking less subtle are appreciated.

Ya there would be no code reduction. I was thinking it would make this function less dense - you have to grok a few action-packed lines to understand that these are simply pop operations. Separating the mechanics of the pop might aid in legibility. Feel free to leave as-is!

RaduBerinde

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @mgartner)

pkg/sql/row/kv_batch_fetcher.go, line 537 at r4 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Ya there would be no code reduction. I was thinking it would make this function less dense - you have to grok a few action-packed lines to understand that these are simply pop operations. Separating the mechanics of the pop might aid in legibility. Feel free to leave as-is!

A popRemainingBatches function would be used at least three times. Below we would just do f.remainingBatches, t.BatchResponses = t.BatchResponses, nilbefore calling it.

mgartner

LGTM

Previously, we could leave some dangling references to batch responses around in the txnKVFetcher when we were fetching more than one batch at a time. This would cause a delay in reclamation of memory for the lifetime of a given query. Release note (bug fix): use less memory in some queries, primarily lookup joins.

jordanlewis · 2021-06-03T15:29:54Z

TFTRs, I was able to extract a little helper.

bors r+

craig · 2021-06-03T17:06:59Z

Build failed (retrying...):

GitHub CI (Cockroach)

mgartner · 2021-06-03T17:54:50Z

@jordanlewis I added backport labels, because I'm assuming this should be backported to as many versions as possible.

craig · 2021-06-03T19:32:33Z

Build succeeded:

GitHub CI (Cockroach)

jordanlewis requested a review from a team June 1, 2021 12:18

yuzefovich approved these changes Jun 1, 2021

View reviewed changes

jordanlewis force-pushed the fix-kvfetcher-leak branch from 1f572c0 to 876dc0c Compare June 1, 2021 15:55

jordanlewis mentioned this pull request Jun 1, 2021

sql: can OOM a cluster with concurrent TPC-H load #64906

Closed

mgartner approved these changes Jun 1, 2021

View reviewed changes

jordanlewis commented Jun 2, 2021

View reviewed changes

RaduBerinde reviewed Jun 2, 2021

View reviewed changes

jordanlewis force-pushed the fix-kvfetcher-leak branch from 876dc0c to 3bb865d Compare June 2, 2021 23:12

mgartner approved these changes Jun 3, 2021

View reviewed changes

jordanlewis force-pushed the fix-kvfetcher-leak branch from 3bb865d to d6d394b Compare June 3, 2021 14:41

mgartner added backport-20.1.x labels Jun 3, 2021

craig bot merged commit 0f0c5d0 into cockroachdb:master Jun 3, 2021

jordanlewis deleted the fix-kvfetcher-leak branch June 7, 2021 13:39

This was referenced Jun 7, 2021

release-21.1: fix intra-query memory leaks #66145

Merged

release-20.2: fix intra-query memory leaks #66170

Merged

release-20.1: row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #66171

Merged

jordanlewis mentioned this pull request Jun 11, 2021

kv,storage: use a BytesMonitor to track memory allocations for scans #66362

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #65881

row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #65881

jordanlewis commented May 29, 2021 •

edited

Loading

cockroach-teamcity commented May 29, 2021

yuzefovich left a comment

mgartner left a comment

jordanlewis left a comment

rytaft commented Jun 2, 2021

jordanlewis left a comment

rytaft commented Jun 2, 2021

mgartner commented Jun 2, 2021

RaduBerinde left a comment

mgartner left a comment

jordanlewis commented Jun 3, 2021

craig bot commented Jun 3, 2021

mgartner commented Jun 3, 2021

craig bot commented Jun 3, 2021

row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #65881

row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #65881

Conversation

jordanlewis commented May 29, 2021 • edited Loading

cockroach-teamcity commented May 29, 2021

yuzefovich left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

rytaft commented Jun 2, 2021

jordanlewis left a comment

Choose a reason for hiding this comment

rytaft commented Jun 2, 2021

mgartner commented Jun 2, 2021

RaduBerinde left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

jordanlewis commented Jun 3, 2021

craig bot commented Jun 3, 2021

mgartner commented Jun 3, 2021

craig bot commented Jun 3, 2021

jordanlewis commented May 29, 2021 •

edited

Loading