Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: hotspotsplits/nodes=4 failed #33660

Closed
cockroach-teamcity opened this issue Jan 11, 2019 · 101 comments · Fixed by #45323
Closed

roachtest: hotspotsplits/nodes=4 failed #33660

cockroach-teamcity opened this issue Jan 11, 2019 · 101 comments · Fixed by #45323
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/f5e3c29b2eed92868cf3d449575283e2e383f199

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=hotspotsplits/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1088848&tab=buildLog

The test failed on master:
	test.go:696,hotspotsplits.go:94,hotspotsplits.go:110: dial tcp 35.193.104.30:26257: connect: connection refused

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Jan 11, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 11, 2019
@tbg
Copy link
Member

tbg commented Jan 11, 2019

F190111 08:22:03.057452 18658 server/server_engine_health.go:54  [n1] disk stall detected: unable to write to <no-attributes>=/mnt/data1/cockroach within 10s
goroutine 18658 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4204d4200, 0xc4204d42a0, 0x4774e00, 0x1e)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1005 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x4eec480, 0xc400000004, 0x4774e62, 0x1e, 0x36, 0xc429281200, 0x5c)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:872 +0x939
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3601040, 0xc420a984e0, 0x4, 0x2, 0x0, 0x0, 0xc429e14f48, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3601040, 0xc420a984e0, 0x1, 0x4, 0x0, 0x0, 0xc429e14f48, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Shout(0x3601040, 0xc420a984e0, 0x4, 0xc429e14f48, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:89 +0xa4
github.com/cockroachdb/cockroach/pkg/server.guaranteedExitFatal(0x3601040, 0xc420a984e0, 0x307c7be, 0x34, 0xc420cde7b0, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server_engine_health.go:54 +0xe9
github.com/cockroachdb/cockroach/pkg/server.assertEngineHealth.func1.1()
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server_engine_health.go:62 +0xd7
created by time.goFunc
	/usr/local/go/src/time/sleep.go:172 +0x44

on GCE. What gives? This has just started happening yesterday, but the stall detection has been in for longer.

Nightlies_Roachtest_Nightly_-_GCE_671_artifacts.zip

@tbg
Copy link
Member

tbg commented Jan 11, 2019

The mechanism seems to do what it's supposed to - the log is full of slow disk:

I190111 08:19:43.428848 202 server/status/runtime.go:464  [n1] runtime stats: 4.7 GiB RSS, 365 goroutines, 796 MiB/368 MiB/1.2 GiB GO alloc/idle/total, 3.7 GiB/4.2 GiB CGO alloc/total, 491.7 CGO/sec, 83.4/16.4 %(u/s)time, 0.0 %gc (3x), 177 MiB/70 MiB (r/w)net
W190111 08:19:43.758916 153 storage/store.go:3847  [n1,s1] handle raft ready: 0.8s [processed=0]
W190111 08:19:43.814798 155 storage/store.go:3847  [n1,s1] handle raft ready: 0.9s [processed=0]
W190111 08:19:43.911328 164 storage/store.go:3847  [n1,s1] handle raft ready: 0.9s [processed=0]
W190111 08:19:44.899871 146 storage/store.go:3847  [n1,s1] handle raft ready: 1.6s [processed=0]
W190111 08:19:45.698178 153 storage/store.go:3847  [n1,s1] handle raft ready: 1.9s [processed=0]
W190111 08:19:45.767701 158 storage/store.go:3847  [n1,s1] handle raft ready: 2.0s [processed=0]
W190111 08:19:45.855355 147 storage/store.go:3847  [n1,s1] handle raft ready: 2.0s [processed=0]
W190111 08:19:46.157423 167 storage/store.go:3847  [n1,s1] handle raft ready: 1.6s [processed=0]
W190111 08:19:46.239392 160 storage/store.go:3847  [n1,s1] handle raft ready: 1.7s [processed=0]
W190111 08:19:48.060893 157 storage/store.go:3847  [n1,s1] handle raft ready: 2.1s [processed=0]
...
W190111 08:21:01.269794 95 storage/closedts/provider/provider.go:144  [ct-closer] unable to move closed timestamp forward: not live
W190111 08:21:03.373006 204 server/node.go:886  [n1,summaries] health alerts detected: {Alerts:[{StoreID:1 Category:METRICS Description:queue.split.process.failure Value:1 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:1 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
I190111 08:21:03.500499 202 server/status/runtime.go:464  [n1] runtime stats: 4.7 GiB RSS, 336 goroutines, 509 MiB/668 MiB/1.2 GiB GO alloc/idle/total, 3.7 GiB/4.2 GiB CGO alloc/total, 1198.5 CGO/sec, 162.3/38.4 %(u/s)time, 0.1 %gc (11x), 361 MiB/516 MiB (r/w)net
W190111 08:21:05.710860 209 storage/node_liveness.go:516  [n1,hb] slow heartbeat took 4.5s
I190111 08:21:05.710884 209 storage/node_liveness.go:449  [n1,hb] heartbeat failed on epoch increment; retrying
W190111 08:21:05.710950 18064 storage/node_liveness.go:516  [n1,s1,r184/1:/Table/53/1/-8{95298…-83876…}] slow heartbeat took 4.0s
E190111 08:21:05.710963 18064 storage/replica_range_lease.go:291  [n1,s1,r184/1:/Table/53/1/-8{95298…-83876…}] heartbeat failed on epoch increment
W190111 08:21:05.711074 18083 storage/node_liveness.go:516  [n1,s1,r116/1:/Table/53/1/1{578138…-651838…}] slow heartbeat took 4.0s
E190111 08:21:05.711105 18083 storage/replica_range_lease.go:291  [n1,s1,r116/1:/Table/53/1/1{578138…-651838…}] heartbeat failed on epoch increment
W190111 08:21:05.711166 18084 storage/node_liveness.go:516  [n1,s1,r159/1:/Table/53/1/-{434403…-317891…}] slow heartbeat took 4.0s
E190111 08:21:05.711177 18084 storage/replica_range_lease.go:291  [n1,s1,r159/1:/Table/53/1/-{434403…-317891…}] heartbeat failed on epoch increment
W190111 08:21:05.711264 18065 storage/node_liveness.go:516  [n1,s1,r151/1:/Table/53/1/-4{32642…-19050…}] slow heartbeat took 4.0s
E190111 08:21:05.711276 18065 storage/replica_range_lease.go:291  [n1,s1,r151/1:/Table/53/1/-4{32642…-19050…}] heartbeat failed on epoch increment
W190111 08:21:05.711333 18085 storage/node_liveness.go:516  [n1,s1,r175/1:/Table/53/1/2{284665…-396980…}] slow heartbeat took 4.0s
E190111 08:21:05.711344 18085 storage/replica_range_lease.go:291  [n1,s1,r175/1:/Table/53/1/2{284665…-396980…}] heartbeat failed on epoch increment
W190111 08:21:05.711411 18092 storage/node_liveness.go:516  [n1,s1,r172/1:/Table/53/1/2{669154…-788087…}] slow heartbeat took 4.0s
...

Interestingly I'm only seeing one instance of this:

> W190111 08:21:47.802956 153 storage/engine/rocksdb.go:1953  batch [8/710917/0] commit took 1.117851396s (>500ms):

and logging mostly stops around that time, until the crash a few sec later.

The latest runtime stats are

I190111 08:21:53.544518 202 server/status/runtime.go:464 [n1] runtime stats: 4.7 GiB RSS, 379 goroutines, 650 MiB/550 MiB/1.2 GiB GO alloc/idle/total, 3.6 GiB/4.1 GiB CGO alloc/total, 375.2 CGO/sec, 107.8/23.2 %(u/s)time, 0.0 %gc (1x), 5.6 MiB/75 MiB (r/w)net

That cgo number is low for a node that is supposedly running lots of splits.

@tbg
Copy link
Member

tbg commented Jan 11, 2019

@petermattis where do you think we should go from here? Can you check (or find someone to do so) whether we're catching write throttling or whether the GCE disk stats for these nodes shows any degradation?

@tbg
Copy link
Member

tbg commented Jan 11, 2019

Perhaps the disk stall test should also, for informative purposes, use a raw write to a temp file in the storage directory (distinguishing between RocksDB slowness and actual block device slowness).

@tbg
Copy link
Member

tbg commented Jan 11, 2019

For example, we could've run into Rocks write throttling.

@tbg tbg assigned petermattis and unassigned nvanbenschoten Jan 11, 2019
@petermattis
Copy link
Collaborator

We've only seen this failure once. Certainly could be some GCE badness, or it could be something else. I'll poke around and see if I notice anything you didn't.

@petermattis
Copy link
Collaborator

The logs contain lots of slow heartbeat and slow handle raft ready messages. But this occurs on all of the nodes and seems to be typical for this test.

@petermattis
Copy link
Collaborator

It's curious that most of the slow handle raft ready messages indicate that no Raft commands were processed:

W190111 08:21:34.025290 152 storage/store.go:3847  [n1,s1] handle raft ready: 0.8s [processed=0]
W190111 08:21:34.084771 156 storage/store.go:3847  [n1,s1] handle raft ready: 0.9s [processed=0]
W190111 08:21:34.166622 146 storage/store.go:3847  [n1,s1] handle raft ready: 0.7s [processed=0]
W190111 08:21:36.127867 171 storage/store.go:3847  [n1,s1] handle raft ready: 0.5s [processed=0]
W190111 08:21:37.009337 166 storage/store.go:3847  [n1,s1] handle raft ready: 0.5s [processed=0]
W190111 08:21:37.885984 157 storage/store.go:3847  [n1,s1] handle raft ready: 0.6s [processed=0]
W190111 08:21:38.115349 154 storage/store.go:3847  [n1,s1] handle raft ready: 0.6s [processed=0]

It's unfortunate that those log messages don't indicate the replica being processed. I'll fix that.

@petermattis
Copy link
Collaborator

Looks like the long handle Raft ready messages are due to RocksDB sync commits. I think. I'm adding more instrumentation to be sure.

@petermattis
Copy link
Collaborator

99%-tile log commit latencies around 1s. That seems surprisingly high.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/395d842feb97c5bd8cad2b32b71a5156c03061eb

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=hotspotsplits/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1115923&tab=buildLog

The test failed on master:
	test.go:743,hotspotsplits.go:94,hotspotsplits.go:110: dial tcp 35.224.251.205:26257: connect: connection refused

@tbg
Copy link
Member

tbg commented Jan 30, 2019

The last failure is fixed in #34399.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8179cd9efec890f1ba063488c7a502a96b8241dc

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=hotspotsplits/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1119877&tab=buildLog

The test failed on release-2.1:
	test.go:743,test.go:755: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1119877-hotspotsplits-nodes-4 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b --local-ssd-no-ext4-barrier returned:
		stderr:
		
		stdout:
		Creating cluster teamcity-1119877-hotspotsplits-nodes-4 with 4 nodes
		Unable to create cluster:
		in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20190122a --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account [email protected] --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script014952889 --project cockroach-ephemeral]
		Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  '(resource type:pd-ssd)'.
		
		: exit status 1
		Cleaning up...
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/82026117d83262e87873aad52b8eca2dd0bea335

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=hotspotsplits/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1126329&tab=buildLog

The test failed on provisional_201902052032_v2.2.0-alpha.20190211:
	test.go:743,test.go:755: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1126329-hotspotsplits-nodes-4 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b --local-ssd-no-ext4-barrier returned:
		stderr:
		
		stdout:
		-subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20190122a --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account [email protected] --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script600134656 --project cockroach-ephemeral]
		Output: Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-central1-b/instances/teamcity-1126329-hotspotsplits-nodes-4-0003].
		Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-central1-b/instances/teamcity-1126329-hotspotsplits-nodes-4-0004].
		ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  '(resource type:pd-ssd)'.
		
		: exit status 1
		Cleaning up...
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=hotspotsplits/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog

The test failed on master:
	hotspotsplits.go:94,hotspotsplits.go:110,test.go:1214: read tcp 172.17.0.2:48262->35.231.144.244:26257: read: connection reset by peer

@nvanbenschoten
Copy link
Member

Two different nodes hit OOMs a few minutes apart. Both have the majority of their memory (2.65 GB and 3.86 GB) allocated by github.com/cockroachdb/cockroach/pkg/storage/engine.copyFromSliceVector. Here's the goroutine stack of one of the callers of this function.

goroutine 13139 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:311 fp=0xc0167a97e0 sp=0xc0167a97d8 pc=0x758dd0
runtime.mallocgc(0x16806db, 0x0, 0x0, 0xc000000000)
	/usr/local/go/src/runtime/malloc.go:913 +0x896 fp=0xc0167a9880 sp=0xc0167a97e0 pc=0x70a6e6
github.com/cockroachdb/cockroach/pkg/storage/engine.nonZeroingMakeByteSlice(0x16806db, 0x2d00000005, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/slice.go:20 +0x39 fp=0xc0167a98b0 sp=0xc0167a9880 pc=0x1349169
github.com/cockroachdb/cockroach/pkg/storage/engine.copyFromSliceVector(0x7f63ce3ef100, 0x5, 0x0, 0x0, 0xb)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2389 +0x8e fp=0xc0167a9950 sp=0xc0167a98b0 pc=0x135d4de
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).MVCCScan(0xc005d3b068, 0xc029eb6040, 0xb, 0x10, 0xc008ceab40, 0xb, 0x10, 0x8ad, 0x158bc3deb0aa76e3, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:2350 +0x328 fp=0xc0167a9b80 sp=0xc0167a9950 pc=0x135c908
github.com/cockroachdb/cockroach/pkg/storage/engine.MVCCScanToBytes(0x39db560, 0xc005e39b00, 0x39f7200, 0xc005d3aff0, 0xc029eb6040, 0xb, 0x10, 0xc008ceab40, 0xb, 0x10, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/mvcc.go:2042 +0x21a fp=0xc0167a9cb0 sp=0xc0167a9b80 pc=0x134251a
github.com/cockroachdb/cockroach/pkg/storage/batcheval.Scan(0x39db560, 0xc005e39b00, 0x3a15d20, 0xc005d3aff0, 0x3a22620, 0xc0066a0d80, 0x158bc3deb0aa76e3, 0x0, 0x200000002, 0x2, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/batcheval/cmd_scan.go:49 +0x202 fp=0xc0167aa170 sp=0xc0167a9cb0 pc=0x1507482
github.com/cockroachdb/cockroach/pkg/storage.evaluateCommand(0x39db560, 0xc005e39b00, 0x0, 0x0, 0x0, 0x3a15d20, 0xc005d3aff0, 0x3a22620, 0xc0066a0d80, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_evaluate.go:393 +0x280 fp=0xc0167aa868 sp=0xc0167aa170 pc=0x195e8b0
github.com/cockroachdb/cockroach/pkg/storage.evaluateBatch(0x39db560, 0xc005e39b00, 0x0, 0x0, 0x3a15d20, 0xc005d3aff0, 0x3a22620, 0xc0066a0d80, 0x0, 0x158bc3deb0aa76e3, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_evaluate.go:225 +0x31d fp=0xc0167ab318 sp=0xc0167aa868 pc=0x195d89d
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeReadOnlyBatch(0xc0066a0d80, 0x39db560, 0xc005e39b00, 0x158bc3deb0aa76e3, 0x0, 0x200000002, 0x2, 0x17a, 0x0, 0xc0ed47b000, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_read.go:98 +0x622 fp=0xc0167ab9d0 sp=0xc0167ab318 pc=0x19894c2
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc0066a0d80, 0x39db560, 0xc005e39b00, 0x17a, 0x158bc3deb0aa76e3, 0x0, 0x200000002, 0x2, 0x17a, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:517 +0x297 fp=0xc0167abb48 sp=0xc0167ab9d0 pc=0x1947f47
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc0066a0d80, 0x39db560, 0xc005e39ad0, 0x158bc3deb0aa76e3, 0x0, 0x200000002, 0x2, 0x17a, 0x0, 0xc0ed47b000, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:462 +0x83 fp=0xc0167abc08 sp=0xc0167abb48 pc=0x1947c73
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc000aa0000, 0x39db560, 0xc005e39ad0, 0x158bc3deb0aa76e3, 0x0, 0x200000002, 0x2, 0x17a, 0x0, 0xc0ed47b000, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2896 +0x70c fp=0xc0167ac140 sp=0xc0167abc08 pc=0x19a80ec
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc0009c4460, 0x39db560, 0xc005e39a70, 0x0, 0x0, 0x200000002, 0x2, 0x17a, 0x0, 0xc0ed47b000, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0xcb fp=0xc0167ac230 sp=0xc0167ac140 pc=0x19c290b
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x39db560, 0xc005e39a70, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:981 +0x1b0 fp=0xc0167ac340 sp=0xc0167ac230 pc=0x24b7480
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc000b4a000, 0x39db560, 0xc005e39a70, 0x3283ee1, 0x10, 0xc0167ac3d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:304 +0xee fp=0xc0167ac378 sp=0xc0167ac340 pc=0x1084fce
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000600480, 0x39db560, 0xc005e39a70, 0xc0da24b380, 0xc005e39a70, 0x31, 0x203001)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:968 +0x16a fp=0xc0167ac400 sp=0xc0167ac378 pc=0x249098a
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000600480, 0x39db560, 0xc005e39a70, 0xc0da24b380, 0xc1207a7110, 0xc0167ac4d0, 0x7e4f2c)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:1009 +0x9c fp=0xc0167ac480 sp=0xc0167ac400 pc=0x2490a9c
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(0x39aed20, 0xc000600480, 0x39db560, 0xc005e39a40, 0xc0da24b380, 0x0, 0x0, 0x0, 0xc0066f0000, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:456 +0x4f fp=0xc0167ac4c8 sp=0xc0167ac480 pc=0x152284f
github.com/cockroachdb/cockroach/pkg/rpc.(*internalClientAdapter).Batch(0xc0003011d0, 0x39db560, 0xc005e39a40, 0xc0da24b380, 0x0, 0x0, 0x0, 0x158bc42a45c1fe01, 0x0, 0x0)
	<autogenerated>:1 +0x99 fp=0xc0167ac530 sp=0xc0167ac4c8 pc=0x152e2a9
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).sendBatch(0xc005e398c0, 0x39db560, 0xc005e39a40, 0x2, 0x39ba460, 0xc0003011d0, 0x0, 0x0, 0x200000002, 0x2, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:239 +0x126 fp=0xc0167ac5e8 sp=0xc0167ac530 pc=0x157c3c6
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext.func1()
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:207 +0xa1 fp=0xc0167ac6d0 sp=0xc0167ac5e8 pc=0x1590251
github.com/cockroachdb/cockroach/pkg/kv.withMarshalingDebugging(0x39db560, 0xc005e39a40, 0x0, 0x0, 0x200000002, 0x2, 0x17a, 0x0, 0xc0ed47b000, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:188 +0xab fp=0xc0167ac708 sp=0xc0167ac6d0 pc=0x157bf5b
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext(0xc005e398c0, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x200000002, 0x2, 0x17a, 0x0, 0xc0ed47b000, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:206 +0x211 fp=0xc0167ac8e8 sp=0xc0167ac708 pc=0x157c1c1
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc000b9c000, 0x39db560, 0xc0066f0000, 0xc000b9c050, 0x17a, 0xc008db2320, 0x3, 0x3, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1363 +0x2d3 fp=0xc0167acc90 sp=0xc0167ac8e8 pc=0x1572223
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc000b9c000, 0x39db560, 0xc0066f0000, 0x17a, 0xc008db2320, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x244 fp=0xc0167acdb0 sp=0xc0167acc90 pc=0x156c504
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc000b9c000, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0ed47b000, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x221 fp=0xc0167acf30 sp=0xc0167acdb0 pc=0x156ca81
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc000b9c000, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0ed47b000, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1139 +0x322 fp=0xc0167ad4e0 sp=0xc0167acf30 pc=0x1570372
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc000b9c000, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0ed47b000, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:962 +0x8b3 fp=0xc0167ad8a8 sp=0xc0167ad4e0 pc=0x156f1a3
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc000b9c000, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc014138500, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1205 +0x812 fp=0xc0167ade58 sp=0xc0167ad8a8 pc=0x1570862
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc000b9c000, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc014138500, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:962 +0x8b3 fp=0xc0167ae220 sp=0xc0167ade58 pc=0x156f1a3
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc000b9c000, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00656d600, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:710 +0x48b fp=0xc0167ae508 sp=0xc0167ae220 pc=0x156e03b
github.com/cockroachdb/cockroach/pkg/kv.(*txnLockGatekeeper).SendLocked(0xc1207ac410, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00656d600, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:234 +0xe8 fp=0xc0167ae5c0 sp=0xc0167ae508 pc=0x157d238
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc1207ac348, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00656d600, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:160 +0x83 fp=0xc0167ae6a0 sp=0xc0167ae5c0 pc=0x158ac83
github.com/cockroachdb/cockroach/pkg/kv.(*txnSpanRefresher).SendLocked(0xc1207ac348, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00656d600, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_span_refresher.go:101 +0xf9 fp=0xc0167ae790 sp=0xc0167ae6a0 pc=0x158a759
github.com/cockroachdb/cockroach/pkg/kv.(*txnPipeliner).SendLocked(0xc1207ac2c0, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00656d600, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_pipeliner.go:182 +0xf9 fp=0xc0167ae8a8 sp=0xc0167ae790 pc=0x1588839
github.com/cockroachdb/cockroach/pkg/kv.(*txnSeqNumAllocator).SendLocked(0xc1207ac268, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00656d600, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_seq_num_allocator.go:92 +0x23b fp=0xc0167ae9d8 sp=0xc0167ae8a8 pc=0x158a51b
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc1207ac000, 0x39db560, 0xc0066f0000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:762 +0x591 fp=0xc0167aed28 sp=0xc0167ae9d8 pc=0x157ffe1
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc000030100, 0x39db560, 0xc06124b020, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:622 +0x119 fp=0xc0167aedf8 sp=0xc0167aed28 pc=0x116e569
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc06129c1b0, 0x39db560, 0xc06124b020, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:791 +0x13c fp=0xc0167aef58 sp=0xc0167aedf8 pc=0x11787bc
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).fetch(0xc05b469790, 0x39db560, 0xc06124b020, 0x80004, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:242 +0x626 fp=0xc0167af458 sp=0xc0167aef58 pc=0x1c32396
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).nextBatch(0xc05b469790, 0x39db560, 0xc06124b020, 0x39db560, 0xc06124b020, 0xc0612ba1e0, 0xc1ab3d861e, 0xb, 0x8001e, 0xc1ab3d8633, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:326 +0x1dd fp=0xc0167af658 sp=0xc0167af458 pc=0x1c32e6d
github.com/cockroachdb/cockroach/pkg/sql/row.(*kvFetcher).nextKV(0xc1207ae4d8, 0x39db560, 0xc06124b020, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:71 +0x2ef fp=0xc0167af850 sp=0xc0167af658 pc=0x1c3389f
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).NextKey(0xc1207ae4a0, 0x39db560, 0xc06124b020, 0xc1ab3d861e, 0xb, 0x8001e)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:490 +0x82 fp=0xc0167af950 sp=0xc0167af850 pc=0x1c244b2
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).NextRow(0xc1207ae4a0, 0x39db560, 0xc06124b020, 0xc01bc5e000, 0x0, 0x96000, 0x3a150a0, 0xc00656d5e0, 0xc007285aa0, 0x1d98a4f)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:990 +0x14a fp=0xc0167afa40 sp=0xc0167af950 pc=0x1c276aa
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc1207a54c0, 0x8d9c5486fd165563, 0xc007285b08, 0x1d98b1e, 0xc007285b08)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:160 +0x4a fp=0xc0167afac8 sp=0xc0167afa40 pc=0x1df9aea
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc1207ae000, 0xc061298a20, 0x2, 0x2, 0x2a0bf3d8b6fc4fa)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:277 +0x4c fp=0xc0167afb18 sp=0xc0167afac8 pc=0x1dfaa4c
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).mainLoop(0xc025613500, 0x39db4a0, 0xc1207a7100, 0xc00032c400, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:179 +0x428 fp=0xc0167aff20 sp=0xc0167afb18 pc=0x1de0648
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*samplerProcessor).Run(0xc025613500, 0x39db4a0, 0xc1207a7100)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sampler.go:162 +0x104 fp=0xc0167affa0 sp=0xc0167aff20 pc=0x1de00d4
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal.func1(0xc061230fc0, 0xc0612ccad0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:564 +0x67 fp=0xc0167affc8 sp=0xc0167affa0 pc=0x1e08f07
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0167affd0 sp=0xc0167affc8 pc=0x75aeb1
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:563 +0x342

At the top of this stack, we can see a *samplerProcessor.

We see from the heap profile that the memory allocated by copyFromSliceVector is all in ~30MB chunks. This implies that the *samplerProcessor is scanning over entire ranges which are of a reasonable size and it rules out the idea that a single range is huge.

Screen Shot 2019-03-14 at 5 35 59 PM

@RaduBerinde or @rytaft any idea why *samplerProcessor (or something along its stack) would be holding on to so much memory?

@rytaft
Copy link
Collaborator

rytaft commented Mar 14, 2019

The samplerProcessor used to hold onto a lot of memory due to the problem described in #32614. Once that bug was fixed, it could still hold onto 16x too much memory since datums are allocated in slices of size 16, according to sqlbase.datumAllocSize. But this is only relevant if samples are collected for histograms, and we disabled histograms for automatic stats in 19.1.

It's possible that the extra memory usage is due to the hashes stored in the HyperLogLog sketches, but in that case I don't think the memory should show up as being allocated at the storage layer. Also, the memory used by the sketches should be pretty small unless this is a very wide table with a lot of indexes.

Are there multiple samplerProcessors running per node?

@RaduBerinde do you have any ideas?

@RaduBerinde
Copy link
Member

RaduBerinde commented Mar 14, 2019

The sampler is not holding on to any memory from the input rows. It only maintains the sketches which are updated with hashes of input data. The processors are fused so it's probably the fetcher inside the tablereader holding on to this memory. The only thing I can think of is what Becca pointed out, if we're running many of these processors in parallel (auto stats should only run one at a time so it would be a bug).

@nvanbenschoten
Copy link
Member

Are there multiple samplerProcessors running per node?

No, there is only one samplerProcessor per node.

The sampler is not holding on to any memory from the input rows. It only maintains the sketches which are updated with hashes of input data.

How about in the sampleAggregator? I also see this goroutine which may have been handed memory from the samplerProcessor:

goroutine 13117 [chan receive, 5 minutes]:
runtime.gopark(0x33df040, 0xc061298a18, 0xc00003170d, 0x3)
	/usr/local/go/src/runtime/proc.go:302 +0xeb fp=0xc0066dc700 sp=0xc0066dc6e0 pc=0x72c6db
runtime.goparkunlock(0xc061298a18, 0x170d, 0x3)
	/usr/local/go/src/runtime/proc.go:308 +0x53 fp=0xc0066dc730 sp=0xc0066dc700 pc=0x72c783
runtime.chanrecv(0xc0612989c0, 0xc061292808, 0x1, 0xc061292828)
	/usr/local/go/src/runtime/chan.go:520 +0x2be fp=0xc0066dc7c0 sp=0xc0066dc730 pc=0x703dee
runtime.chanrecv2(0xc0612989c0, 0xc061292808, 0x70a9c8)
	/usr/local/go/src/runtime/chan.go:407 +0x2b fp=0xc0066dc7f0 sp=0xc0066dc7c0 pc=0x703b1b
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*RowChannel).Next(0xc1207a7080, 0xc061231c00, 0xc1207a8990, 0x60643a4fdf98001, 0xc1207c0000)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:464 +0x4a fp=0xc0066dc838 sp=0xc0066dc7f0 pc=0x1da442a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*sampleAggregator).mainLoop(0xc090d0d500, 0x39db560, 0xc1207a8990, 0xc00032c420, 0x11, 0x39db560)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sample_aggregator.go:174 +0x1af fp=0xc0066dc9a8 sp=0xc0066dc838 pc=0x1dde04f
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*sampleAggregator).Run(0xc090d0d500, 0x39db560, 0xc1207a8990)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sample_aggregator.go:133 +0x104 fp=0xc0066dca28 sp=0xc0066dc9a8 pc=0x1dddd54
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc061230fc0, 0x39db560, 0xc1207a8990, 0x33d9b08, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:624 +0x1ec fp=0xc0066dca90 sp=0xc0066dca28 pc=0x1db3a0c
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc00004ea00, 0xc061298720, 0xc06129c120, 0xc0612933d0, 0xc00a914580, 0xc0fdc5dbc0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:252 +0x8ad fp=0xc0066dd2d0 sp=0xc0066dca90 pc=0x200b4bd
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).planAndRunCreateStats(0xc00004ea00, 0x39db560, 0xc0612ac570, 0xc0fdc5dbc0, 0xc061298720, 0xc06129c120, 0xc0611e4d20, 0xc06121f160, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_plan_stats.go:229 +0x2a5 fp=0xc0066dd580 sp=0xc0066dd2d0 pc=0x20089e5
github.com/cockroachdb/cockroach/pkg/sql.(*createStatsResumer).Resume.func2(0x39db560, 0xc0612647e0, 0xc06129c120, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/create_stats.go:325 +0x23f fp=0xc0066dd648 sp=0xc0066dd580 pc=0x212fddf
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn.func1(0x39db560, 0xc0612647e0, 0xc06129c120, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:585 +0x43 fp=0xc0066dd680 sp=0xc0066dd648 pc=0x117b6a3
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).exec(0xc06129c120, 0x39db560, 0xc0612647e0, 0xc0611e9490, 0x1, 0xc06129c120)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:688 +0xd9 fp=0xc0066dd800 sp=0xc0066dd680 pc=0x1177fd9
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn(0xc000030380, 0x39db560, 0xc0612647e0, 0xc0612b8140, 0xc0612647e0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:584 +0xe3 fp=0xc0066dd870 sp=0xc0066dd800 pc=0x116e273
github.com/cockroachdb/cockroach/pkg/sql.(*createStatsResumer).Resume(0xc1003134f8, 0x39db560, 0xc0612647e0, 0xc0611e4d20, 0x3263ee0, 0xc0fdc5db00, 0xc148465da0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/create_stats.go:316 +0x2c0 fp=0xc0066ddcb8 sp=0xc0066dd870 pc=0x1fc96f0
github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resume.func1(0x39db560, 0xc0612647e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/registry.go:543 +0x355 fp=0xc0066ddf68 sp=0xc0066ddcb8 pc=0x1a47b75
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc000b4a000, 0x39db4a0, 0xc0611e2d40, 0xc081496e60, 0x1e, 0x0, 0x0, 0xc079b57090)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:325 +0xe6 fp=0xc0066ddfa0 sp=0xc0066ddf68 pc=0x1086b56
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0066ddfa8 sp=0xc0066ddfa0 pc=0x75aeb1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:320 +0x134

Also, FWIW we see transaction interceptors like txnMetricRecorder missing from the previous stacktrace, which indicates that this is a "Leaf" TxnCoordSender. I believe that means that this is a remote DistSQL flow.

@RaduBerinde
Copy link
Member

The sampler only sends sketch data to the sampleAggregator.

Note that the sampler is throttled (sleeps a lot) so it might be holding onto a given KV batch for much longer than a regular scan. But it shouldn't be holding on to more than one batch.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b5768aecd39461ab9a54e2e7db059a3fe8b00459

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=hotspotsplits/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191957&tab=buildLog

The test failed on release-19.1:
	hotspotsplits.go:94,hotspotsplits.go:110,test.go:1214: read tcp 172.17.0.2:42048->35.231.237.27:26257: read: connection reset by peer

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bb50fd396f6ce79258b744e1f8efa2d1bc9dfbd2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=hotspotsplits/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1272551&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1400,hotspotsplits.go:39,hotspotsplits.go:110,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1272551-hotspotsplits-nodes-4:1-3 returned:
		stderr:
		
		stdout:
		teamcity-1272551-hotspotsplits-nodes-4: starting..........................................................................................................................
		1: exit status 255
		~ ./cockroach version
		Connection to 35.227.17.104 closed by remote host.
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.getCockroachVersion
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:95
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:289
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1447
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190503 07:49:36.791655 1 cluster_synced.go:1529  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@349d2c0fff138c2a3f452b11b900924d6dc8b445:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: read tcp 172.17.0.2:59206->35.223.250.113:26257: read: connection reset by peer

	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1729787-1580889572-10-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 4219
		1: dead
		3: 4074
		Error:  1: dead

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@08bb94d98b7f60126666e9c5a317afca61422d7f:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1737376-1581149318-03-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: dead
		1: 4412
		2: 3900
		Error:  3: dead

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@c4cf5585cbda55caadb1782f6705e52a83659538:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	cluster.go:1410,context.go:135,cluster.go:1399,test_runner.go:778: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1739656-1581235400-05-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: dead
		2: 3836
		1: 4129
		Error:  3: dead

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@fde991a0f5aba7ffd66112295996ad26f3725036:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 232 MiB exceeded 192 MiB
		main.registerHotSpotSplits.func1.2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/hotspotsplits.go:69
		github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@b797cad6d130714748983bc53d4611ddc6151153:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 252 MiB exceeded 192 MiB
		main.registerHotSpotSplits.func1.2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/hotspotsplits.go:69
		github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@a51ddb8342ff54283baf5c7556cb1d4ff8c4e5da:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 561 MiB exceeded 192 MiB
		main.registerHotSpotSplits.func1.2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/hotspotsplits.go:69
		github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@49e2b18471a42aed8a8ad7cb658863f045ea1aac:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 490 MiB exceeded 192 MiB
		main.registerHotSpotSplits.func1.2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/hotspotsplits.go:69
		github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Feb 24, 2020

@ajwerner is this you? hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 490 MiB exceeded 192 MiB

tbg added a commit to tbg/cockroach that referenced this issue Feb 24, 2020
Fixes cockroachdb#33660.

Release note (bug fix): Significantly reduce the amount of memory
allocated while scanning tables with a large average row size.
@tbg
Copy link
Member

tbg commented Feb 24, 2020

Ah yeah obviously 660b3e7. I'll fix, stressing this test right now anyway

tbg added a commit to tbg/cockroach that referenced this issue Feb 24, 2020
660b3e7 bumped the default range size
by a factor of 8. Do the same in this test.

Addresses one failure mode of cockroachdb#33660.

Release note: None
@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@abe944fb0ad990ceaba7e592cc83b8922639016c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 529 MiB exceeded 192 MiB
		main.registerHotSpotSplits.func1.2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/hotspotsplits.go:69
		github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@8b5adba703fae9b6961623f65b685d93b0fe0290:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 556 MiB exceeded 192 MiB
		main.registerHotSpotSplits.func1.2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/hotspotsplits.go:69
		github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@317061202f4510e8087ff6ba9e2e5cb2fce0bb70:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 238 MiB exceeded 192 MiB
		main.registerHotSpotSplits.func1.2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/hotspotsplits.go:69
		github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@ajwerner
Copy link
Contributor

I’ll take the more recent failures here that are related to the range size changes.

@ajwerner ajwerner self-assigned this Feb 27, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).hotspotsplits/nodes=4 failed on master@fc5c7f093bf1e86852c3b839bc0f6710d9902729:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/hotspotsplits/nodes=4/run_1
	hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 258 MiB exceeded 192 MiB
		main.registerHotSpotSplits.func1.2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/hotspotsplits.go:69
		github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /hotspotsplits/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

tbg added a commit to tbg/cockroach that referenced this issue Feb 28, 2020
660b3e7 bumped the default range size
by a factor of 8. Do the same in this test.

Addresses one failure mode of cockroachdb#33660.

Release note: None
craig bot pushed a commit that referenced this issue Feb 28, 2020
45323: row: set TargetBytes for kvfetcher r=nvanbenschoten a=tbg

This finishes up the work in #44925 and completes the TargetBytes functionality. This is then used in kvfetcher, which henceforth aims to return no more than ~1mb per request. Additional commits restore the hotspotsplits roachtest and fix it. Reverting the relevant commit from this PR, the test failed nine out of ten times). With all commits, it passed ten times.

The one question I have whether TargetBytes should be set to a value higher than 1mb to avoid a performance regression (where should I look?).

Fixes #33660.

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig craig bot closed this as completed in ef8ec0f Feb 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
9 participants