Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/telemetry: crash on Windows when counter file is extended #68358

Closed
findleyr opened this issue Jul 9, 2024 · 7 comments
Closed

x/telemetry: crash on Windows when counter file is extended #68358

findleyr opened this issue Jul 9, 2024 · 7 comments
Assignees
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Windows release-blocker telemetry x/telemetry issues
Milestone

Comments

@findleyr
Copy link
Member

findleyr commented Jul 9, 2024

Discovered while investigating #68311: telemetry counter file extension can cause windows processes to panic. For example: this crash is from a new end-to-end test added in https://go.dev/cl/597278, without any changes to the x/telemetry logic.

https://ci.chromium.org/ui/p/golang/builders/try/x_telemetry-go1.21-windows-amd64/b8742848988042691361/test-results?q=ExactID%3Agolang.org%2Fx%2Ftelemetry%2Finternal%2Fcounter.TestConcurrentExtension+VHash%3Afb236f0bcc2ba24d&clean=&sortby=&groupby=

I don't think this is a theoretical or pathological setup. With the high concurrency of toolchain processes, it seems likely that Windows users will encounter crashes due to this bug. Furthermore, these crashes will only become more likely as we add more instrumentation. Therefore, marking as a release blocker.

CC @golang/telemetry

@findleyr findleyr added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Jul 9, 2024
@findleyr findleyr added this to the Go1.23 milestone Jul 9, 2024
@findleyr findleyr self-assigned this Jul 9, 2024
@gopherbot gopherbot added the telemetry x/telemetry issues label Jul 9, 2024
@gabyhelp
Copy link

gabyhelp commented Jul 9, 2024

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@findleyr
Copy link
Member Author

findleyr commented Jul 9, 2024

Investigating briefly, it seems likely that this is due to the unmapping of the existing file before it is remapped:
https://go.googlesource.com/telemetry/+/refs/tags/config/v0.25.0/internal/mmap/mmap_windows.go#18

@hyangah
Copy link
Contributor

hyangah commented Jul 10, 2024

Thanks for the investigation. What happens if you delete the munmapFile call?
Is it there because Windows is unhappy about creating another map on the same file handle?

@findleyr
Copy link
Member Author

In https://go.dev/cl/597279, I tested exactly that (see TestMultipleMaps). Windows seems happy to provide multiple views of the same file.

My hypothesis is that the unmap is done only for tests, which fail to clean up if there are any outstanding mappings of the counter file. See also the SetFinalizer call to mappedFile.close - the need for a setfinalizer is that counter.file.current may be read without holding counter.file.mu, and therefore the lifecycle of mappedFiles is unmanaged.

I believe the correct solution is to turn counter.file.current into a refcounted value, managed by the counter.file, and ensure that mappings are correctly cleaned up in tests.

@findleyr findleyr added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 10, 2024
@findleyr
Copy link
Member Author

After a bit more testing (ensuring that other mappings remain valid when one is closed), I believe the solution described in #68358 (comment) is the path forward. Therefore marking this as NeedsFix.

I will aim for a fix today. The fix will not be completely trivial, but there are only 7 non-test calls to file.current.load that need to be updated, and the logic should be simple. The additional complexity of awaiting the unmapping will only affect tests.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/597278 mentions this issue: internal/counter: fix windows mapped file extension

@findleyr
Copy link
Member Author

Well, after much iteration I realized that the existing code was very close to unmapping files correctly: we just need to close the mapping after invalidating counters, which must be done after unlocking the counter.file. This was done after rotation, but not after extension (but coincidentally I'd left a TODO a while ago to investigate why it wasn't done after extension).

So, we're able to (1) remove the unsafe unmapping, (2) remove the runtime.SetFinalizer logic, and (3) resolve that TODO, without introducing additional mechanisms. A satisfactory change.

I believe this should be a relatively safe CL to land.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Sep 13, 2024
Changes in this release

This release updates the golang.org/x/telemetry dependency to pick
up fixes for the following bugs:

    golang/go#68946: unnecessary downloading of the golang.org/x/telemetry/config module
    golang/go#68358: a potential crash on Windows when the telemetry counter file is extended
    golang/go#68311: a potential hanging process if the telemetry file is truncated

Additionally, this release changes the gopls/telemetryprompt/accepted
counter to be recorded each time the prompt state is checked
(golang/go#68770).

None of these issues were particularly urgent, but we also wanted
to make a gopls release to exercise our recently added release
automation (golang/go#57643).
Pinkle-pash added a commit to Pinkle-pash/telemetry that referenced this issue Dec 4, 2024
On Windows, unmapping the previous mappedFile in mmap.Mmap caused
panics when counter pointers were read concurrent to the remapping.

Upon investigation, it appears that the unmapping was only necessary for
tests, to ensure that previous mappings were cleaned up (and therefore
that test files can be deleted). A call to runtime.SetFinalizer(...,
mappedFile.close) appeared to serve a similar purpose, yet for an
unknown reason finalizers were never run.

Deeper investigation revealed that there was simply one bug in file
cleanup (coincidentally already noted in a TODO): after storing the
newly mapped file in file.newCounter1 and invalidating counters, we can
close the previous mappedFile.

Therefore:
- fix the cleanup in file.newCounter1
- remove the unmap in mmap.Mmap on windows
- remove the now unnecessary 'existing' parameter in mmap APIs
- remove the SetFinalizer call
- add a test for multiple concurrent mappings of a file
- add an end-to-end test for concurrent file extension
- change ReadCounter to read by memory mapping the file, in an attempt
  to avoid msync issues

For golang/go#68311
Fixes golang/go#68358

Change-Id: I27b6f4f4939e93f7c76f920d553848bf014be236
Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/597278
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Hyang-Ah Hana Kim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. OS-Windows release-blocker telemetry x/telemetry issues
Projects
Status: Done
Development

No branches or pull requests

5 participants