prometheusremotewrite: release lock when wal empty #20875

sh0rez · 2023-04-12T15:00:09Z

Description:

Previously, the wal tailing routine held an exclusive lock over the entire wal, while waiting for new sections to be written.

This led to a deadlock situtation, as a locked wal naturally can not be written to.

To remedy this, the lock is now only held during actual read attempts, not in between.

Furthermore, inotify watching of the wal dir has been replaced with channel-based signaling between the producers and the tailing routine, so that the latter blocks until any new writes have happened.

Link to tracking Issue:

Fixes #19363
Fixes #15277

Testing:

An "end to end" test has been added where the WAL is first initialized without any initial data, so it goes into waiting mode. Writes are performed afterwards and the read-back result is verified.

This previously led to a deadlock and now no longer does.

It has also been confirmed that goroutines no longer build up and actual http requests are now being made.

Previously, the wal tailing routine held an exclusive lock over the entire wal, while waiting for new sections to be written. This led to a deadlock situtation, as a locked wal naturally can not be written to. To remedy this, the lock is now only held during actual read attempts, not in between. Furthermore, inotify watching of the wal dir has been replaced with channel-based signaling between the producers and the tailing routine, so that the latter blocks until any new writes have happened.

linux-foundation-easycla · 2023-04-12T15:00:15Z

The committers listed above are authorized under a signed CLA.

✅ login: sh0rez / name: sh0rez (45fa20d, 9dedddd, 0633065, f2040b6, 0c959a7, b548215, b8a4760, ff35276, b049dee)

exporter/prometheusremotewriteexporter/wal.go

frzifus · 2023-05-24T16:44:46Z

Whats the state here? :) ping @sh0rez

codeboten · 2023-05-31T16:16:59Z

Looks like the checks hung, re-opening to trigger the checks

codeboten

Please add a changelog entry for the bug fix

codeboten

Looks like a test is failing, please have a look

--- FAIL: TestWAL_persist (0.01s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa696b6]

goroutine 292 [running]:
testing.tRunner.func1.2({0x11fa360, 0x1abbe70})
	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1526 +0x372
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1529 +0x650
panic({0x11fa360, 0x1abbe70})
	/opt/hostedtoolcache/go/1.20.4/x64/src/runtime/panic.go:890 +0x263
go.uber.org/zap.(*Logger).check(0x0, 0xff, {0x1306013, 0x5})
	/home/runner/go/pkg/mod/go.uber.org/[email protected]/logger.go:298 +0x76
go.uber.org/zap.(*Logger).Debug(0x149d400?, {0x1306013, 0x5}, {0xc0003cb2c0, 0x1, 0x1})
	/home/runner/go/pkg/mod/go.uber.org/[email protected]/logger.go:211 +0x55
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.(*prweWAL).persistToWAL(0xc0000a60e0, {0xc000175e20, 0x2, 0x0?})
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter/wal.go:324 +0x308
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.TestWAL_persist(0xc0001f4d00)
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter/wal_test.go:138 +0xb11
testing.tRunner(0xc0001f4d00, 0x135cf28)
	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1576 +0x217
created by testing.(*T).Run
	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1629 +0x806
FAIL	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter	0.074s
FAIL

Initializes the WAL with a default nop logger, instead of previous nil which led to panics.

github-actions · 2023-06-20T05:20:28Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

sh0rez · 2023-07-04T14:13:05Z

@codeboten any chance you could take another look?

exporter/prometheusremotewriteexporter/wal.go

Aneurysm9 · 2023-07-05T16:50:54Z

exporter/prometheusremotewriteexporter/wal.go

 		rWALIndex:  &atomic.Uint64{},
 		wWALIndex:  &atomic.Uint64{},
-	}, nil
+		log:        zap.NewNop(),


This doesn't seem appropriate. Is there not a logger available to provide here?

There is a logger available in the TelemetrySettings. But adding a zap.NewNop() logger here, should be equal to the state before without any logger, right?

Not that I have a strong opinion about it - I just would like to resolve it at some point 😄

I think it is more the point of, if we are adding a logger, we need some way of providing a configuration for it.

the logger is taken from the context once run() is called, but we need a place to store it for long running operations.

not setting this var leads to a nil pointer that panics, so its NewNop() instead to avoid that. afaict this is never called before run() anways

github-actions · 2023-07-20T05:19:36Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

frzifus · 2023-07-20T08:18:37Z

ping @sh0rez

github-actions · 2023-09-09T05:18:34Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

sh0rez · 2024-05-02T15:07:38Z

@frzifus @Aneurysm9 @codeboten somehow dropped the ball on this, but it's still relevant.

rebased and addressed review comments, please take another look!

jpkrohling

LGTM

jpkrohling · 2024-05-03T07:53:44Z

There seems to be linting failures here. I would love to have an approval from a code owner as well: @Aneurysm9 @rapphil

frzifus

Thx, for picking this up again. :)

jpkrohling · 2024-05-07T12:32:05Z

exporter/prometheusremotewriteexporter/wal_test.go

+	}
+
+	// wait until the tail routine is no longer busy
+	wal.rNotify <- struct{}{}


Just to confirm: if I remove this, the test will be blocked on the <-done until it times out. That's the purpose of this test, and what you are fixing here, right?

I'm also a bit confused as to why this is needed here. run() will fork and (indirectly) call read(), blocking until it returns, which will happen when a record is read from the WAL or the context is cancelled. read() will try to read from the WAL, find nothing, and then block reading from rNotify. persistToWAL() will then acquire the mutex, signal on rNotify, and write to the WAL. Once persistToWAL() signals on rNotify then read() starts blocking on acquiring the mutex in tryRead(), eventually acquiring it once persistToWAL() executes its deferred unlock. read() will then get a successful result from tryRead(), popping all the way back out to continuallyPopWALThenExport() which may or may not actually export (probably not with the default truncate frequency and buffer size), and then goes right back into read(). At this point the WAL is empty and tryRead() returns the sentinel error read() uses to decide to wait on rNotify. Since persistToWAL() has returned nothing is going to send anything on rNotify until this line here. Sending a value here causes read() to try again, but nothing happens. Then the context is cancelled and a bunch of select statements hit their <-ctx.Done() case, causing read() to return nil, context.Canceled to continuallyPopWALThenExport(), which returns context.Canceled to run(), executing its deferred export along the way, causing sink to close done and allowing this function to reach its requirement that the exported data equals the input data.

tl;dr: I don't see what writing to this channel does that isn't also accomplished by cancelling the context.

@sh0rez , are you still interested in pursuing this?

Aneurysm9 · 2024-05-08T00:33:10Z

exporter/prometheusremotewriteexporter/wal.go

 		rWALIndex:  &atomic.Uint64{},
 		wWALIndex:  &atomic.Uint64{},
+
+		// populated from context in run()
+		log: zap.NewNop(),


If this is populated from the context do we need to store it here? It seems like we should be doing one or the other, but not both. I originally added it to the context used by run() in this now-squashed commit, but I honestly couldn't give you a good reason why I did it that way. If it's needed in the prweWAL struct then lets accept it as a parameter to newWAL().

I had the exact same feeling, but thought it would be ok to have it like that given it kind of follows the pattern of the existing code. I'd be ok to have this as part of a follow up PR.

I'm torn on this. I feel strongly that we shouldn't be knowingly introducing new problems when we also know what the fix is, particularly when the fix isn't a big lift. I also realize this is a purely internal change and doesn't expose any of the wrongness outside of the package.

The other conversation about the blocking on the notify channel in the test seems more important to me to resolve. If we come to a resolution on that but this isn't fixed I think I'd be at a position where I wouldn't approve this PR but wouldn't object to it being merged with an understanding that this needs to be fixed as a fast follower.

Aneurysm9 · 2024-05-08T01:45:31Z

exporter/prometheusremotewriteexporter/wal_test.go

+	}
+
+	// wait until the tail routine is no longer busy
+	wal.rNotify <- struct{}{}


I'm also a bit confused as to why this is needed here. run() will fork and (indirectly) call read(), blocking until it returns, which will happen when a record is read from the WAL or the context is cancelled. read() will try to read from the WAL, find nothing, and then block reading from rNotify. persistToWAL() will then acquire the mutex, signal on rNotify, and write to the WAL. Once persistToWAL() signals on rNotify then read() starts blocking on acquiring the mutex in tryRead(), eventually acquiring it once persistToWAL() executes its deferred unlock. read() will then get a successful result from tryRead(), popping all the way back out to continuallyPopWALThenExport() which may or may not actually export (probably not with the default truncate frequency and buffer size), and then goes right back into read(). At this point the WAL is empty and tryRead() returns the sentinel error read() uses to decide to wait on rNotify. Since persistToWAL() has returned nothing is going to send anything on rNotify until this line here. Sending a value here causes read() to try again, but nothing happens. Then the context is cancelled and a bunch of select statements hit their <-ctx.Done() case, causing read() to return nil, context.Canceled to continuallyPopWALThenExport(), which returns context.Canceled to run(), executing its deferred export along the way, causing sink to close done and allowing this function to reach its requirement that the exported data equals the input data.

tl;dr: I don't see what writing to this channel does that isn't also accomplished by cancelling the context.

github-actions · 2024-05-23T05:19:45Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2024-06-07T05:19:27Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2024-06-22T05:19:31Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2024-07-07T05:19:48Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2024-07-22T05:20:17Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

sh0rez requested a review from a team April 12, 2023 15:00

sh0rez requested a review from Aneurysm9 as a code owner April 12, 2023 15:00

github-actions bot assigned Aneurysm9 Apr 12, 2023

github-actions bot added the exporter/prometheusremotewrite label Apr 12, 2023

frzifus reviewed Apr 21, 2023

View reviewed changes

exporter/prometheusremotewriteexporter/wal.go Show resolved Hide resolved

exporter/prometheusremotewriteexporter/wal.go Outdated Show resolved Hide resolved

exporter/prometheusremotewriteexporter/wal.go Show resolved Hide resolved

frzifus mentioned this pull request May 25, 2023

[exporter/prometheusremotewrite] Enabling WAL prevents metrics from being forwarded #15277

Open

frzifus approved these changes May 25, 2023

View reviewed changes

codeboten closed this May 31, 2023

codeboten reopened this May 31, 2023

codeboten reviewed May 31, 2023

View reviewed changes

prometheusremotewrite: wal default nop-logger

9dedddd

Initializes the WAL with a default nop logger, instead of previous nil which led to panics.

github-actions bot requested a review from rapphil June 5, 2023 12:40

sh0rez added 3 commits June 5, 2023 14:45

doc: add changelog entry

0c959a7

prometheusremotewrite: go mod tidy

f2040b6

*: resolve merge conflicts

b049dee

github-actions bot added the Stale label Jun 20, 2023

sh0rez requested a review from codeboten June 23, 2023 08:39

github-actions bot removed the Stale label Jun 24, 2023

Aneurysm9 reviewed Jul 5, 2023

View reviewed changes

github-actions bot added the Stale label Jul 20, 2023

frzifus removed the Stale label Jul 20, 2023

github-actions bot closed this Sep 9, 2023

frzifus mentioned this pull request Nov 16, 2023

prometheusremotewrite: release lock when wal empty #29297

Closed

sh0rez added 2 commits May 2, 2024 16:17

*: forward to main

b548215

*: review feedback

ff35276

jpkrohling reopened this May 2, 2024

*: don't panic

0633065

github-actions bot removed the Stale label May 3, 2024

jpkrohling approved these changes May 3, 2024

View reviewed changes

jpkrohling requested a review from Aneurysm9 May 3, 2024 07:53

frzifus approved these changes May 3, 2024

View reviewed changes

*: make linter happy

b8a4760

jpkrohling reviewed May 7, 2024

View reviewed changes

Aneurysm9 reviewed May 8, 2024

View reviewed changes

github-actions bot added the Stale label May 23, 2024

frzifus removed the Stale label May 23, 2024

github-actions bot added Stale and removed Stale labels Jun 7, 2024

github-actions bot added the Stale label Jun 22, 2024

frzifus removed the Stale label Jun 22, 2024

github-actions bot added the Stale label Jul 7, 2024

github-actions bot closed this Jul 22, 2024

dashpole mentioned this pull request Dec 4, 2024

prometheusremotewriteexporter logs spurious errors with WAL configured and no metrics to be sent #24399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheusremotewrite: release lock when wal empty #20875

prometheusremotewrite: release lock when wal empty #20875

sh0rez commented Apr 12, 2023 •

edited

Loading

linux-foundation-easycla bot commented Apr 12, 2023 •

edited

Loading

frzifus commented May 24, 2023

codeboten commented May 31, 2023

codeboten left a comment

codeboten left a comment

github-actions bot commented Jun 20, 2023

sh0rez commented Jul 4, 2023

Aneurysm9 Jul 5, 2023

frzifus Aug 4, 2023

MovieStoreGuy Aug 9, 2023

sh0rez May 2, 2024

github-actions bot commented Jul 20, 2023

frzifus commented Jul 20, 2023

github-actions bot commented Sep 9, 2023

sh0rez commented May 2, 2024

jpkrohling left a comment

jpkrohling commented May 3, 2024

frzifus left a comment

jpkrohling May 7, 2024

Aneurysm9 May 8, 2024

jpkrohling Jun 7, 2024

Aneurysm9 May 8, 2024

jpkrohling May 8, 2024

Aneurysm9 May 8, 2024

Aneurysm9 May 8, 2024

github-actions bot commented May 23, 2024

github-actions bot commented Jun 7, 2024

github-actions bot commented Jun 22, 2024

github-actions bot commented Jul 7, 2024

github-actions bot commented Jul 22, 2024

prometheusremotewrite: release lock when wal empty #20875

prometheusremotewrite: release lock when wal empty #20875

Conversation

sh0rez commented Apr 12, 2023 • edited Loading

linux-foundation-easycla bot commented Apr 12, 2023 • edited Loading

frzifus commented May 24, 2023

codeboten commented May 31, 2023

codeboten left a comment

Choose a reason for hiding this comment

codeboten left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 20, 2023

sh0rez commented Jul 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 20, 2023

frzifus commented Jul 20, 2023

github-actions bot commented Sep 9, 2023

sh0rez commented May 2, 2024

jpkrohling left a comment

Choose a reason for hiding this comment

jpkrohling commented May 3, 2024

frzifus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 23, 2024

github-actions bot commented Jun 7, 2024

github-actions bot commented Jun 22, 2024

github-actions bot commented Jul 7, 2024

github-actions bot commented Jul 22, 2024

sh0rez commented Apr 12, 2023 •

edited

Loading

linux-foundation-easycla bot commented Apr 12, 2023 •

edited

Loading