Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheusremotewrite: release lock when wal empty #20875

Closed
wants to merge 9 commits into from

Conversation

sh0rez
Copy link
Member

@sh0rez sh0rez commented Apr 12, 2023

Description:

Previously, the wal tailing routine held an exclusive lock over the entire wal, while waiting for new sections to be written.

This led to a deadlock situtation, as a locked wal naturally can not be written to.

To remedy this, the lock is now only held during actual read attempts, not in between.

Furthermore, inotify watching of the wal dir has been replaced with channel-based signaling between the producers and the tailing routine, so that the latter blocks until any new writes have happened.

Link to tracking Issue:

Fixes #19363
Fixes #15277

Testing:

An "end to end" test has been added where the WAL is first initialized without any initial data, so it goes into waiting mode. Writes are performed afterwards and the read-back result is verified.

This previously led to a deadlock and now no longer does.

It has also been confirmed that goroutines no longer build up and actual http requests are now being made.

Previously, the wal tailing routine held an exclusive lock over the
entire wal, while waiting for new sections to be written.

This led to a deadlock situtation, as a locked wal naturally can not be written to.

To remedy this, the lock is now only held during actual read attempts,
not in between.

Furthermore, inotify watching of the wal dir has been replaced with
channel-based signaling between the producers and the tailing routine,
so that the latter blocks until any new writes have happened.
@sh0rez sh0rez requested a review from a team April 12, 2023 15:00
@sh0rez sh0rez requested a review from Aneurysm9 as a code owner April 12, 2023 15:00
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 12, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@frzifus
Copy link
Member

frzifus commented May 24, 2023

Whats the state here? :) ping @sh0rez

@codeboten
Copy link
Contributor

Looks like the checks hung, re-opening to trigger the checks

@codeboten codeboten closed this May 31, 2023
@codeboten codeboten reopened this May 31, 2023
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a changelog entry for the bug fix

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a test is failing, please have a look

--- FAIL: TestWAL_persist (0.01s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa696b6]

goroutine 292 [running]:
testing.tRunner.func1.2({0x11fa360, 0x1abbe70})
	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1526 +0x372
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1529 +0x650
panic({0x11fa360, 0x1abbe70})
	/opt/hostedtoolcache/go/1.20.4/x64/src/runtime/panic.go:890 +0x263
go.uber.org/zap.(*Logger).check(0x0, 0xff, {0x1306013, 0x5})
	/home/runner/go/pkg/mod/go.uber.org/[email protected]/logger.go:298 +0x76
go.uber.org/zap.(*Logger).Debug(0x149d400?, {0x1306013, 0x5}, {0xc0003cb2c0, 0x1, 0x1})
	/home/runner/go/pkg/mod/go.uber.org/[email protected]/logger.go:211 +0x55
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.(*prweWAL).persistToWAL(0xc0000a60e0, {0xc000175e20, 0x2, 0x0?})
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter/wal.go:324 +0x308
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter.TestWAL_persist(0xc0001f4d00)
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter/wal_test.go:138 +0xb11
testing.tRunner(0xc0001f4d00, 0x135cf28)
	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1576 +0x217
created by testing.(*T).Run
	/opt/hostedtoolcache/go/1.20.4/x64/src/testing/testing.go:1629 +0x806
FAIL	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusremotewriteexporter	0.074s
FAIL

Initializes the WAL with a default nop logger, instead of previous nil
which led to panics.
@github-actions github-actions bot requested a review from rapphil June 5, 2023 12:40
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jun 20, 2023
@sh0rez sh0rez requested a review from codeboten June 23, 2023 08:39
@github-actions github-actions bot removed the Stale label Jun 24, 2023
@sh0rez
Copy link
Member Author

sh0rez commented Jul 4, 2023

@codeboten any chance you could take another look?

exporter/prometheusremotewriteexporter/wal.go Outdated Show resolved Hide resolved
rWALIndex: &atomic.Uint64{},
wWALIndex: &atomic.Uint64{},
}, nil
log: zap.NewNop(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem appropriate. Is there not a logger available to provide here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a logger available in the TelemetrySettings. But adding a zap.NewNop() logger here, should be equal to the state before without any logger, right?

Not that I have a strong opinion about it - I just would like to resolve it at some point 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more the point of, if we are adding a logger, we need some way of providing a configuration for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logger is taken from the context once run() is called, but we need a place to store it for long running operations.

not setting this var leads to a nil pointer that panics, so its NewNop() instead to avoid that. afaict this is never called before run() anways

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jul 20, 2023
@frzifus frzifus removed the Stale label Jul 20, 2023
@frzifus
Copy link
Member

frzifus commented Jul 20, 2023

ping @sh0rez

@github-actions
Copy link
Contributor

github-actions bot commented Sep 9, 2023

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@jpkrohling jpkrohling reopened this May 2, 2024
@sh0rez
Copy link
Member Author

sh0rez commented May 2, 2024

@frzifus @Aneurysm9 @codeboten somehow dropped the ball on this, but it's still relevant.

rebased and addressed review comments, please take another look!

@github-actions github-actions bot removed the Stale label May 3, 2024
Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jpkrohling
Copy link
Member

There seems to be linting failures here. I would love to have an approval from a code owner as well: @Aneurysm9 @rapphil

@jpkrohling jpkrohling requested a review from Aneurysm9 May 3, 2024 07:53
Copy link
Member

@frzifus frzifus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, for picking this up again. :)

}

// wait until the tail routine is no longer busy
wal.rNotify <- struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm: if I remove this, the test will be blocked on the <-done until it times out. That's the purpose of this test, and what you are fixing here, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a bit confused as to why this is needed here. run() will fork and (indirectly) call read(), blocking until it returns, which will happen when a record is read from the WAL or the context is cancelled. read() will try to read from the WAL, find nothing, and then block reading from rNotify. persistToWAL() will then acquire the mutex, signal on rNotify, and write to the WAL. Once persistToWAL() signals on rNotify then read() starts blocking on acquiring the mutex in tryRead(), eventually acquiring it once persistToWAL() executes its deferred unlock. read() will then get a successful result from tryRead(), popping all the way back out to continuallyPopWALThenExport() which may or may not actually export (probably not with the default truncate frequency and buffer size), and then goes right back into read(). At this point the WAL is empty and tryRead() returns the sentinel error read() uses to decide to wait on rNotify. Since persistToWAL() has returned nothing is going to send anything on rNotify until this line here. Sending a value here causes read() to try again, but nothing happens. Then the context is cancelled and a bunch of select statements hit their <-ctx.Done() case, causing read() to return nil, context.Canceled to continuallyPopWALThenExport(), which returns context.Canceled to run(), executing its deferred export along the way, causing sink to close done and allowing this function to reach its requirement that the exported data equals the input data.

tl;dr: I don't see what writing to this channel does that isn't also accomplished by cancelling the context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sh0rez , are you still interested in pursuing this?

rWALIndex: &atomic.Uint64{},
wWALIndex: &atomic.Uint64{},

// populated from context in run()
log: zap.NewNop(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is populated from the context do we need to store it here? It seems like we should be doing one or the other, but not both. I originally added it to the context used by run() in this now-squashed commit, but I honestly couldn't give you a good reason why I did it that way. If it's needed in the prweWAL struct then lets accept it as a parameter to newWAL().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the exact same feeling, but thought it would be ok to have it like that given it kind of follows the pattern of the existing code. I'd be ok to have this as part of a follow up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on this. I feel strongly that we shouldn't be knowingly introducing new problems when we also know what the fix is, particularly when the fix isn't a big lift. I also realize this is a purely internal change and doesn't expose any of the wrongness outside of the package.

The other conversation about the blocking on the notify channel in the test seems more important to me to resolve. If we come to a resolution on that but this isn't fixed I think I'd be at a position where I wouldn't approve this PR but wouldn't object to it being merged with an understanding that this needs to be fixed as a fast follower.

}

// wait until the tail routine is no longer busy
wal.rNotify <- struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a bit confused as to why this is needed here. run() will fork and (indirectly) call read(), blocking until it returns, which will happen when a record is read from the WAL or the context is cancelled. read() will try to read from the WAL, find nothing, and then block reading from rNotify. persistToWAL() will then acquire the mutex, signal on rNotify, and write to the WAL. Once persistToWAL() signals on rNotify then read() starts blocking on acquiring the mutex in tryRead(), eventually acquiring it once persistToWAL() executes its deferred unlock. read() will then get a successful result from tryRead(), popping all the way back out to continuallyPopWALThenExport() which may or may not actually export (probably not with the default truncate frequency and buffer size), and then goes right back into read(). At this point the WAL is empty and tryRead() returns the sentinel error read() uses to decide to wait on rNotify. Since persistToWAL() has returned nothing is going to send anything on rNotify until this line here. Sending a value here causes read() to try again, but nothing happens. Then the context is cancelled and a bunch of select statements hit their <-ctx.Done() case, causing read() to return nil, context.Canceled to continuallyPopWALThenExport(), which returns context.Canceled to run(), executing its deferred export along the way, causing sink to close done and allowing this function to reach its requirement that the exported data equals the input data.

tl;dr: I don't see what writing to this channel does that isn't also accomplished by cancelling the context.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label May 23, 2024
@frzifus frzifus removed the Stale label May 23, 2024
Copy link
Contributor

github-actions bot commented Jun 7, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added Stale and removed Stale labels Jun 7, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jun 22, 2024
@frzifus frzifus removed the Stale label Jun 22, 2024
Copy link
Contributor

github-actions bot commented Jul 7, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jul 7, 2024
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants