Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: kv-level memory accounting/bounding #19721

Open
nvanbenschoten opened this issue Nov 1, 2017 · 5 comments
Open

storage: kv-level memory accounting/bounding #19721

nvanbenschoten opened this issue Nov 1, 2017 · 5 comments
Labels
A-kv-client Relating to the KV client and the KV interface. A-kv-observability A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Nov 1, 2017

#8691 introduces a mechanism to track and limit memory use at the SQL layer, but no effort has been put into tracking and limiting memory usage at any layers beneath this. Without this protection, scans over large tables can still result in trivial out-of-memory crashes, as we've recently seen. I think we'll want to add some memory accounting down in the kv-layer so that we can track memory across all memory-intensive operations in a node (MVCC scans, BatchResponse serialization/deserialization, etc.).

I also think we should add some kind of memory limit to BatchRequests so that responses will not exceed available memory for a client node. Even if one node has enough memory to field a KV request, another may not have enough to receive its response. This could also be accomplished by adding some kind of interceptor for BatchResponses that stops reading the response when it gets too large.

There have been some attempts to limit the size of single kv scans to reasonable numbers of rows. However, these attempts use a constant maximum row count. If rows are unexpectedly large than the current max (10,000 rows) may still be big enough to cause an OOM. We'll probably want to dynamically adjust this maximum limit based on the size of each row. Unfortunately, this problem is exacerbated by the fact that SQL scans can't push column filters down through ScanRequests. This means that large keys may be returned in a ScanResponse even if they are not necessary for the query, leading to unexpected OOMs. This is, for the most part, an orthogonal issue, but demonstrates why we might want memory accounting beneath SQL.

cc. @andreimatei @tschottdorf @knz

Jira issue: CRDB-5958

@petermattis petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-client Relating to the KV client and the KV interface. labels Jul 21, 2018
@tbg tbg modified the milestones: 2.1, 2.2 Jul 22, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@jordanlewis
Copy link
Member

Bumping this - it's (more or less) the cause of the symptoms in #33660. Imagine a full table scan over a table with 512kb rows. The SQL fetchers ask for up to 10k rows at a time - this constant is pretty "magic", but it's been with us for quite some time and reducing it to hedge against large row sizes seems like a bad idea.

We'd love to do better, but it's quite difficult without having a bytes limit on requests instead of a keys limit.

Is there already an issue for this? Can we repurpose this issue to be the feature request:

Add a MaxSpanRequestBytes field to roachpb.Header so that clients can control the maximum number of bytes they'd like to process in memory at once, rather than the maximum number of returned KVs?

tbg added a commit to tbg/cockroach that referenced this issue Jan 24, 2020
TODO: due to the RocksDB/pebble modes of operation, there are
currently two ways to do anything. Thanks to the two possible
response types for batches (KVs vs bytes) there is another
factor of two. In short, I expect to have to make three more
similar changes to fully be able to implement byte hints for
scans.
TODO: testing and potentially write the code in a saner way.

----

A fledgling step towards cockroachdb#19721 is allowing incoming KV requests to
bound the size of the response in terms of bytes rather than rows.
This commit provides the necessary functionality to pebbleMVCCScanner:
The new maxBytes field stops the scan once the size of the result
meets or exceeds the threshold (at least one key will be added,
regardless of its size).

The classic example of the problem this addresses is a table in which
each row is, say, ~1mb in size. A full table scan will currently fetch
data from KV in batches of [10k], causing at least 10GB of data held in
memory at any given moment. This sort of thing does happen in practice;
we have a long-failing roachtest cockroachdb#33660 because of just that, and
anecdotally OOMs in production clusters are with regularity caused by
individual queries consuming excessive amounts of memory at the KV
level.

Plumbing this limit into a header field on BatchRequest and down to the
engine level will allow the batching in [10k] to become byte-sized in
nature, thus avoiding one obvious source OOMs. This doesn't solve cockroachdb#19721
in general (many such requests could work together to consume a lot of
memory after all), but it's a sane baby step that might just avoid a
large portion of OOMs already.

[10k]: https://github.com/cockroachdb/cockroach/blob/0a658c19cd164e7c021eaff7f73db173f0650e8c/pkg/sql/row/kv_batch_fetcher.go#L25-L29

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 30, 2020
A fledgling step towards cockroachdb#19721 is allowing incoming KV requests to
bound the size of the response in terms of bytes rather than rows.
This commit adds a TargetBytes field to MVCCScanOptions to address
this need: scans stop once the size of the result meets or exceeds the
threshold (at least one key will be added, regardless of its size),
and returns a ResumeSpan as appropriate.

The classic example of the problem this addresses is a table in which
each row is, say, ~1mb in size. A full table scan will currently fetch
data from KV in batches of [10k], causing at least 10GB of data held in
memory at any given moment. This sort of thing does happen in practice;
we have a long-failing roachtest cockroachdb#33660 because of just that, and
anecdotally OOMs in production clusters are with regularity caused by
individual queries consuming excessive amounts of memory at the KV
level.

Plumbing this limit into a header field on BatchRequest and down to the
engine level will allow the batching in [10k] to become byte-sized in
nature, thus avoiding one obvious source OOMs. This doesn't solve cockroachdb#19721
in general (many such requests could work together to consume a lot of
memory after all), but it's a sane baby step that might just avoid a
large portion of OOMs already.

[10k]: https://github.com/cockroachdb/cockroach/blob/0a658c19cd164e7c021eaff7f73db173f0650e8c/pkg/sql/row/kv_batch_fetcher.go#L25-L29

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Feb 3, 2020
A fledgling step towards cockroachdb#19721 is allowing incoming KV requests to
bound the size of the response in terms of bytes rather than rows.
This commit adds a TargetBytes field to MVCCScanOptions to address
this need: scans stop once the size of the result meets or exceeds the
threshold (at least one key will be added, regardless of its size),
and returns a ResumeSpan as appropriate.

The classic example of the problem this addresses is a table in which
each row is, say, ~1mb in size. A full table scan will currently fetch
data from KV in batches of [10k], causing at least 10GB of data held in
memory at any given moment. This sort of thing does happen in practice;
we have a long-failing roachtest cockroachdb#33660 because of just that, and
anecdotally OOMs in production clusters are with regularity caused by
individual queries consuming excessive amounts of memory at the KV
level.

Plumbing this limit into a header field on BatchRequest and down to the
engine level will allow the batching in [10k] to become byte-sized in
nature, thus avoiding one obvious source OOMs. This doesn't solve cockroachdb#19721
in general (many such requests could work together to consume a lot of
memory after all), but it's a sane baby step that might just avoid a
large portion of OOMs already.

[10k]: https://github.com/cockroachdb/cockroach/blob/0a658c19cd164e7c021eaff7f73db173f0650e8c/pkg/sql/row/kv_batch_fetcher.go#L25-L29

Release note: None
craig bot pushed a commit that referenced this issue Feb 3, 2020
44339: engine: add byte limit to MVCCScan r=petermattis,itsbilal a=tbg

A fledgling step towards #19721 is allowing incoming KV requests to
bound the size of the response in terms of bytes rather than rows.
This commit adds a TargetBytes field to MVCCScanOptions to address
this need: scans stop once the size of the result meets or exceeds the
threshold (at least one key will be added, regardless of its size),
and returns a ResumeSpan as appropriate.

The classic example of the problem this addresses is a table in which
each row is, say, ~1mb in size. A full table scan will currently fetch
data from KV in batches of [10k], causing at least 10GB of data held in
memory at any given moment. This sort of thing does happen in practice;
we have a long-failing roachtest #33660 because of just that, and
anecdotally OOMs in production clusters are with regularity caused by
individual queries consuming excessive amounts of memory at the KV
level.

Plumbing this limit into a header field on BatchRequest and down to the
engine level will allow the batching in [10k] to become byte-sized in
nature, thus avoiding one obvious source OOMs. This doesn't solve #19721
in general (many such requests could work together to consume a lot of
memory after all), but it's a sane baby step that might just avoid a
large portion of OOMs already.

[10k]: https://github.com/cockroachdb/cockroach/blob/0a658c19cd164e7c021eaff7f73db173f0650e8c/pkg/sql/row/kv_batch_fetcher.go#L25-L29

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@lunevalex
Copy link
Collaborator

Related discussion in #51551

@tbg
Copy link
Member

tbg commented Aug 21, 2020

Summarizing the discussion from #51551: now that we're using pebble, we should be able to accurately allocate from a memory budget in mvccScanner (we can budget in other places too - but this is the one that matters). It looks like there was consensus that that budget should be that given by the --sql-memory flag (i.e. we should not attempt to force users to try to partition memory between the SQL frontend and KV backend). There was some discussion on whether to queue for budget or to fail. I would say on the first stab we will definitely fail as what we're trying to achieve is the simplest means possible to avoid OOMs. Adding any kind of queuing is complex and comes with new debugging challenges.

It's worth noting that even with all of this in place, we still need to retain some form of the speculative accounting on the SQL side (table fetchers) for the common case in which the request does not hit the local fast-path. If it does hit the local fast path, we could conceivably "transfer" the allocated budget from KV back up to SQL, but there may be practical challenges to doing so. Naively, KV would release the allocation "after" the response has been sent to the client (though that "after" is likely difficult to implement, depending on what hooks gRPC offers here; it may need to happen in a custom marshaler or something like that, may not be worth it).

@tbg tbg added A-kv-observability A-kv-server Relating to the KV-level RPC server labels Sep 3, 2020
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 22, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 26, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 28, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 28, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jun 29, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jul 12, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jul 14, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Jul 14, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 4, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
that is passed via the EvalContext interface. The other
alternative would be for the engine to have a BytesMonitor at
initialization time that it can use to construct a BoundAccount
for each MVCC scan, and pass it back via MVCCScanResult. This
would mean multiple BoundAccounts for a batch (since we don't
want to release memory until all the requests in the batch are
processed), and would be harder to extend to track additional
request types compared to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 4, 2021
The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
that is passed via the EvalContext interface. The other
alternative would be for the engine to have a BytesMonitor at
initialization time that it can use to construct a BoundAccount
for each MVCC scan, and pass it back via MVCCScanResult. This
would mean multiple BoundAccounts for a batch (since we don't
want to release memory until all the requests in the batch are
processed), and would be harder to extend to track additional
request types compared to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs cockroachdb#19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.
craig bot pushed a commit that referenced this issue Aug 4, 2021
66362: kv,storage: use a BytesMonitor to track memory allocations for scans r=sumeerbhola a=sumeerbhola

The goal is to track memory allocations for:
- non-local SQL=>KV requests: this can happen with joins,
  multi-tenant clusters, and if ranges move between DistSQL
  planning and execution.
- local SQL=>KV requests for the first request by a fetcher:
  in this case the fetcher reserves a modest 1KB which can
  be significantly exceeded by KV allocations.

Only allocations in pebbleMVCCScanner for kv pairs and intents are
tracked. The memory is released when returning from
executeReadOnlyBatchWithServersideRefreshes since the chain
of returns will end up in gRPC response processing and we can't
hook into where that memory is released. This should still help
for some cases of OOMs, and give some signal of memory overload
that we can use elsewhere (e.g. admission control).

The BytesMonitor is used to construct a BoundAccount that is
wrapped in a narrower ScannerMemoryMonitor that is passed via the
EvalContext interface. The other alternative would be for the
engine to have a BytesMonitor at initialization time that it
can use to construct a BoundAccount for each MVCC scan, and
pass it back via MVCCScanResult. This would mean multiple
BoundAccounts for a batch (since we don't want to release memory
until all the requests in the batch are processed), and would
be harder to extend to track additional request types compared
to embedding in EvalContext.

The rootSQLMonitor is reused for this memory allocation tracking.
This tracking is always done for non-local requests, and for the
first request by a fetcher for a local request. This is to
avoid double-counting, the first request issued by a SQL fetcher
only reserves 1KB, but subsequent ones have already reserved
what was returned in the first response. So there is room to
tighten this if we knew what had been reserved on the local
client (there are complications because the batch may have
been split to send to different nodes, only one of which was
local).

The AdmissionHeader.SourceLocation field is used to mark local
requests and is set in rpc.internalClientAdapter. The first
request is marked using the
AdmissionHeader.NoMemoryReservedAtSource bit.

Informs #19721

Release note (ops change): The memory pool used for SQL is now
also used to cover KV memory used for scans.

67451: colexec: optimize Bytes sorting with abbreviated values r=mgartner a=mgartner

#### colexec: add benchmark for sorting UUIDs

Release note: None

#### colexec: optimize Bytes sorting with abbreviated values

This commit introduces an optimization for sorting `Bytes` vectors.
While sorting, a signification potion of time is spent performing
comparisons between two datums. These comparisons are optimized by
creating an abbreviated `uint64` value for each `[]byte` that represents
up to the first eight bytes in the slice. The abbreviated value is used
for a comparison fast path that avoid comparing all bytes in a slice in
some cases. This technique is used by Postgres[1][2] and Pebble[3].

Abbreviating values to a `uint64` helps optimize comparisons for datums
that are larger than the size of the system's native pointer (64 bits).
Given datums `a` and `b`, the following properties must hold true for
the respective abbreviated values `abbrA` and `abbrB`:

  - `abbrA > abbrB` iff `a > b`
  - `abbrA < abbrB` iff `a < b`
  - If `abbrA == abbrB`, it is unknown if `a` is greater than, less
    than, or equal to `b`. A full comparison of `a` and `b` is required.

Comparing the abbreviated values first improves comparison performance
for two reasons. First, comparing two `uint64`s is fast. It is a single
CPU instruction. Any datum larger than 64 bits requires multiple
instructions for comparison. For example, comparing two `[]byte`s
requires iterating over each byte until non-equal bytes are found.
Second, CPU caches are more efficiently used because the abbreviated
values of a vector are packed into a `[]uint64` of contiguous memory.
This increases the chance that two datums can be compared by only
fetching information from CPU caches rather than main memory.

In the benchmarks below, `rows` is the number of rows being sorted,
`cols` is the number of sort columns, and `constAbbrPct` is the
percentage of rows for which the abbreviated values are the same. The
benchmarks show that in the best case, when the abbreviated values are
all unique, sort throughput has increased by up to ~45%. In the worst
case, where all values share the same abbreviated value, throughput has
decreased by as much as ~8.5%. This is due to the extra overhead of
creating abbreviated values and comparing them, only to have to fall
back to a full comparison every time.

    name                                             old speed      new speed      delta
    SortUUID/rows=2048/cols=1/constAbbrPct=0-16      38.0MB/s ± 2%  55.3MB/s ± 3%  +45.30%
    SortUUID/rows=2048/cols=1/constAbbrPct=50-16     36.2MB/s ± 3%  43.0MB/s ± 4%  +18.82%
    SortUUID/rows=2048/cols=1/constAbbrPct=75-16     36.0MB/s ± 4%  37.1MB/s ± 4%     ~
    SortUUID/rows=2048/cols=1/constAbbrPct=90-16     36.8MB/s ± 1%  36.2MB/s ± 3%     ~
    SortUUID/rows=2048/cols=1/constAbbrPct=100-16    37.0MB/s ± 1%  33.8MB/s ± 5%   -8.66%
    SortUUID/rows=2048/cols=2/constAbbrPct=0-16      60.3MB/s ± 1%  74.0MB/s ± 3%  +22.85%
    SortUUID/rows=2048/cols=2/constAbbrPct=50-16     58.7MB/s ± 1%  63.7MB/s ± 1%   +8.54%
    SortUUID/rows=2048/cols=2/constAbbrPct=75-16     58.3MB/s ± 1%  56.6MB/s ± 6%     ~
    SortUUID/rows=2048/cols=2/constAbbrPct=90-16     58.7MB/s ± 1%  54.2MB/s ± 3%   -7.69%
    SortUUID/rows=2048/cols=2/constAbbrPct=100-16    59.3MB/s ± 1%  56.1MB/s ± 4%   -5.30%
    SortUUID/rows=16384/cols=1/constAbbrPct=0-16     30.7MB/s ± 2%  41.9MB/s ± 3%  +36.24%
    SortUUID/rows=16384/cols=1/constAbbrPct=50-16    30.6MB/s ± 1%  32.0MB/s ± 8%     ~
    SortUUID/rows=16384/cols=1/constAbbrPct=75-16    30.2MB/s ± 2%  30.9MB/s ± 6%     ~
    SortUUID/rows=16384/cols=1/constAbbrPct=90-16    30.4MB/s ± 2%  27.6MB/s ± 3%   -9.29%
    SortUUID/rows=16384/cols=1/constAbbrPct=100-16   30.6MB/s ± 2%  28.5MB/s ± 5%   -6.98%
    SortUUID/rows=16384/cols=2/constAbbrPct=0-16     49.5MB/s ± 2%  59.0MB/s ± 2%  +19.13%
    SortUUID/rows=16384/cols=2/constAbbrPct=50-16    48.6MB/s ± 1%  49.5MB/s ± 3%   +1.84%
    SortUUID/rows=16384/cols=2/constAbbrPct=75-16    47.3MB/s ± 2%  47.3MB/s ± 2%     ~
    SortUUID/rows=16384/cols=2/constAbbrPct=90-16    48.5MB/s ± 2%  45.2MB/s ± 1%   -6.65%
    SortUUID/rows=16384/cols=2/constAbbrPct=100-16   48.9MB/s ± 1%  44.4MB/s ± 2%   -9.13%
    SortUUID/rows=262144/cols=1/constAbbrPct=0-16    25.4MB/s ± 2%  35.7MB/s ± 3%  +40.53%
    SortUUID/rows=262144/cols=1/constAbbrPct=50-16   24.9MB/s ± 3%  28.8MB/s ± 3%  +15.44%
    SortUUID/rows=262144/cols=1/constAbbrPct=75-16   25.4MB/s ± 2%  25.7MB/s ± 5%     ~
    SortUUID/rows=262144/cols=1/constAbbrPct=90-16   25.4MB/s ± 3%  24.7MB/s ± 2%   -3.00%
    SortUUID/rows=262144/cols=1/constAbbrPct=100-16  25.1MB/s ± 3%  23.7MB/s ± 2%   -5.73%
    SortUUID/rows=262144/cols=2/constAbbrPct=0-16    37.5MB/s ± 2%  43.7MB/s ± 5%  +16.65%
    SortUUID/rows=262144/cols=2/constAbbrPct=50-16   37.5MB/s ± 1%  37.8MB/s ± 7%     ~
    SortUUID/rows=262144/cols=2/constAbbrPct=75-16   37.1MB/s ± 4%  36.4MB/s ± 1%     ~
    SortUUID/rows=262144/cols=2/constAbbrPct=90-16   36.9MB/s ± 5%  34.6MB/s ± 7%   -6.26%
    SortUUID/rows=262144/cols=2/constAbbrPct=100-16  37.2MB/s ± 3%  34.0MB/s ± 2%   -8.53%

I experimented with several mitigations to the regressions, but did not
have much luck. Postgres uses HyperLogLog while building abbreviated
values to track their cardinality and abort the optimization if the
cardinality is too low. I attempted this, but the overhead of
HyperLogLog negated the benefits of the optimization entirely.

Using a `map[uint64]struct{}` to track the cardinality of abbreviated
values was the most promising. It was able to reduce the worst
regressions at the expense of reducing the speedup of the best case
benchmarks.

These mechanisms for aborting the optimization when abbreviated value
cardinality is low are essentially a trade-off between maximizing sort
performance when their cardinality is high and minimizing the overhead
of generating and comparing abbreviated values when their cardinality is
low. It's worth future investigation to try to find a low-overhead
method of aborting, and to be more thoughtful about the trade-offs.

[1] https://brandur.org/sortsupport
[2] http://pgeoghegan.blogspot.com/2015/01/abbreviated-keys-exploiting-locality-to.html
[3] https://github.com/cockroachdb/pebble/blob/ea60b4722cca21fa0d3c2749c788e1ac76981f7d/internal/base/comparer.go#L28-L36

Release note (performance improvement): Sort performance has been
improved when sorting columns of type STRING, BYTES, or UUID.


67606: sql: add sql shell telemetry counter r=rafiss,knz,arulajmani a=e-mbrown

Fixes #62208

This change allows us to keep track of
connections made using our internal SQl shell.

Release note: None

68427: jobs: skip TestRegistryLifecycle r=adityamaru a=jbowens

Refs: #68315

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: e-mbrown <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023
@knz
Copy link
Contributor

knz commented Oct 9, 2023

still elevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. A-kv-observability A-kv-server Relating to the KV-level RPC server C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

7 participants