Merge #45551 #45567 #45568

45551: ui: standardize the display of node names in node lists r=dhartunian a=petermattis Standardize the display of node names in node lists. The nodes overview list was displaying nodes as `N<id> <ip-address>` while graphs display `<ip-address> (n<id>)`. Standardize on the latter format. A similar problem existed on the statement details page which was displaying `N<id> <ip-address> (n<id>)`. The node id was being displayed twice! Release note: None 45567: storage/concurrency: push reservation holders to detect deadlocks r=nvanbenschoten a=nvanbenschoten This is a partial reversion of #45420. It turns out that there are cases where a reservation holder is a link in a dependency cycle. This can happen when a reservation holder txn is holding on to one reservation while blocking on a lock on another key. If any txns queued up behind the reservation did not push _someone_, they wouldn't detect the deadlock. ``` range A . range B . txnB . txnC txnA | . | ^________ | v . v \ v [lock X: (txnA)] . [lock Y: (txnB)] [res Z: (txnC)] ``` It turns out that this segment of the dependency cycle is always local to a single concurrency manager, so it could potentially forward the push through the reservation links to shorten the cycle and prevent non-lock holders from ever being the victim of a deadlock abort. This is tricky though, so for now, we just push. To address the issue that motivated #45420, we perform this form of push asynchronously while continuing to listen to state transitions in the lockTable. If the pusher is unblocked (see #45420 for an example of when that can happen), it simply cancels the push and proceeds with navigating the lockTable. This PR also adds a set of datadriven deadlock tests to the concurrency manager test suite: [`testdata/concurrency_manager/deadlocks`](https://github.com/cockroachdb/cockroach/pull/45567/files#diff-5934754d5a8f1086698cdbab628ee1b5). These tests construct deadlocks due to lock ordering and request ordering and demonstrate how the deadlocks are resolved. 45568: roachtest: enable txn heartbeat loops for 1PC txns in kv/contention/nodes=4 r=nvanbenschoten a=nvanbenschoten I noticed when debugging issues in #45482 that unhandled deadlocks occasionally resolved themselves because txns would eventually time out. This was because we don't start the txn heartbeat loop for 1PC txns. In this kind of test, we want any unhandled deadlocks to be as loud as possible, so just like we set a very long txn expiration, we also enable the txn heartbeat loop for all txns, even those that we expect will be 1PC. This commit also drops the kv.lock_table.deadlock_detection_push_delay down for the test, since it's already touching this code. Co-authored-by: Peter Mattis <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
cockroachdb · Mar 3, 2020 · 5f9a71a · 5f9a71a
4 parents 06989f3 + 5e31f85 + 51178dc + 898383f
commit 5f9a71a
Show file tree

Hide file tree

Showing 13 changed files with 1,129 additions and 257 deletions.
diff --git a/pkg/cmd/roachtest/kv.go b/pkg/cmd/roachtest/kv.go
@@ -202,13 +202,31 @@ func registerKVContention(r *testRegistry) {
 			// If requests ever get stuck on a transaction that was abandoned
 			// then it will take 2m for them to get unstuck, at which point the
 			// QPS threshold check in the test is likely to fail.
-			args := startArgs("--env=COCKROACH_TXN_LIVENESS_HEARTBEAT_MULTIPLIER=120")
+			//
+			// Additionally, ensure that even transactions that issue a 1PC
+			// batch begin heartbeating. This ensures that if they end up in
+			// part of a dependency cycle, they can never be expire without
+			// being actively aborted.
+			args := startArgs(
+				"--env=COCKROACH_TXN_LIVENESS_HEARTBEAT_MULTIPLIER=120 COCKROACH_TXN_HEARTBEAT_DURING_1PC=true",
+			)
 			c.Start(ctx, t, args, c.Range(1, nodes))
 
+			conn := c.Conn(ctx, 1)
 			// Enable request tracing, which is a good tool for understanding
 			// how different transactions are interacting.
-			c.Run(ctx, c.Node(1),
-				`./cockroach sql --insecure -e "SET CLUSTER SETTING trace.debug.enable = true"`)
+			if _, err := conn.Exec(`
+				SET CLUSTER SETTING trace.debug.enable = true;
+			`); err != nil {
+				t.Fatal(err)
+			}
+			// Drop the deadlock detection delay because the test creates a
+			// large number transaction deadlocks.
+			if _, err := conn.Exec(`
+				SET CLUSTER SETTING kv.lock_table.deadlock_detection_push_delay = '5ms'
+			`); err != nil && !strings.Contains(err.Error(), "unknown cluster setting") {
+				t.Fatal(err)
+			}
 
 			t.Status("running workload")
 			m := newMonitor(ctx, c, c.Range(1, nodes))

diff --git a/pkg/kv/txn_interceptor_heartbeater.go b/pkg/kv/txn_interceptor_heartbeater.go
@@ -16,12 +16,20 @@ import (
 	"time"
 
 	"github.com/cockroachdb/cockroach/pkg/roachpb"
+	"github.com/cockroachdb/cockroach/pkg/util/envutil"
 	"github.com/cockroachdb/cockroach/pkg/util/hlc"
 	"github.com/cockroachdb/cockroach/pkg/util/log"
 	"github.com/cockroachdb/cockroach/pkg/util/stop"
 	opentracing "github.com/opentracing/opentracing-go"
 )
 
+// txnHeartbeatDuring1PC defines whether the txnHeartbeater should launch a
+// heartbeat loop for 1PC transactions. The value defaults to false even though
+// 1PC transactions leave intents around on retriable errors if the batch has
+// been split between ranges and may be pushed when in lock wait-queues because
+// we consider that unlikely enough so we prefer to not pay for a goroutine.
+var txnHeartbeatFor1PC = envutil.EnvOrDefaultBool("COCKROACH_TXN_HEARTBEAT_DURING_1PC", false)
+
 // txnHeartbeater is a txnInterceptor in charge of a transaction's heartbeat
 // loop. Transaction coordinators heartbeat their transaction record
 // periodically to indicate the liveness of their transaction. Other actors like
@@ -151,12 +159,9 @@ func (h *txnHeartbeater) SendLocked(
 		}
 
 		// Start the heartbeat loop if it has not already started.
-		//
-		// Note that we don't do it for 1PC txns: they only leave intents around on
-		// retriable errors if the batch has been split between ranges. We consider
-		// that unlikely enough so we prefer to not pay for a goroutine.
 		if !h.mu.loopStarted {
-			if _, haveEndTxn := ba.GetArg(roachpb.EndTxn); !haveEndTxn {
+			_, haveEndTxn := ba.GetArg(roachpb.EndTxn)
+			if !haveEndTxn || txnHeartbeatFor1PC {
 				if err := h.startHeartbeatLoopLocked(ctx); err != nil {
 					return nil, roachpb.NewError(err)
 				}

diff --git a/pkg/storage/concurrency/concurrency_control.go b/pkg/storage/concurrency/concurrency_control.go
@@ -561,7 +561,7 @@ type lockTableGuard interface {
 	// have an initial notification. Note that notifications are collapsed if
 	// not retrieved, since it is not necessary for the waiter to see every
 	// state transition.
-	NewStateChan() <-chan struct{}
+	NewStateChan() chan struct{}
 
 	// CurState returns the latest waiting state.
 	CurState() waitingState