Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlpineARM32 Only, running dotnet new command will be Stuck with long time high CPU consumption after installing net9.0.100-preview.3.24175.24 #100536

Closed
ChenhuiYuan01 opened this issue Mar 28, 2024 · 23 comments · Fixed by #101147
Labels
area-ExceptionHandling-coreclr in-pr There is an active PR which will close this issue when it is merged

Comments

@ChenhuiYuan01
Copy link
Member

ChenhuiYuan01 commented Mar 28, 2024

Reproduction Steps

  1. On Linux AlpineARM32 in docker , install 9.0.100-preview.3.24175.24 from https://dev.azure.com/dnceng/internal/_build/results?buildId=2414737&view=artifacts&pathAsName=false&type=publishedArtifacts
  2. In CLI, run dotnet new console

Excepted:
The project will be created successfully

Actual Behavior
Project is blocked ---》Stuck with long time high CPU consumption
Running dotnet new console ------🡪 Stuck with long time high CPU consumption.
pending
cpu

Dotnet --info
.NET SDK:
Version: 9.0.100-preview.3.24175.24
Commit: 09d6f381e6
Workload version: 9.0.100-manifests.77bb7ba9
MSBuild version: 17.10.0-preview-24175-03+89b42a486

Runtime Environment:
OS Name: alpine
OS Version: 3.20
OS Platform: Linux
RID: linux-musl-arm
Base Path: /root/mytest/sdk/9.0.100-preview.3.24175.24/

.NET workloads installed:
There are no installed workloads to display.

Host:
Version: 9.0.0-preview.3.24172.9
Architecture: arm
Commit: 9e6ba1f

.NET SDKs installed:
9.0.100-preview.3.24175.24 [/root/mytest/sdk]

.NET runtimes installed:
Microsoft.AspNetCore.App 9.0.0-preview.3.24172.13 [/root/mytest/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 9.0.0-preview.3.24172.9 [/root/mytest/shared/Microsoft.NETCore.App]

Other architectures found:
None

Environment variables:
Not set

global.json file:
Not found

Learn more:
https://aka.ms/dotnet/info

@dotnet-issue-labeler dotnet-issue-labeler bot added area-Workloads Workloads like wasm-tools untriaged New issue has not been triaged by the area owner labels Mar 28, 2024
@ChenhuiYuan01 ChenhuiYuan01 changed the title AlpineARM32 Only, Running dotnet command after installing net9.0.100-preview.3.24175.24 will block. AlpineARM32 Only, running dotnet new command will Stuck with long time high CPU consumption after installing net9.0.100-preview.3.24175.24 Mar 28, 2024
@ChenhuiYuan01 ChenhuiYuan01 changed the title AlpineARM32 Only, running dotnet new command will Stuck with long time high CPU consumption after installing net9.0.100-preview.3.24175.24 AlpineARM32 Only, running dotnet new command will be Stuck with long time high CPU consumption after installing net9.0.100-preview.3.24175.24 Mar 28, 2024
@marcpopMSFT marcpopMSFT removed the area-Workloads Workloads like wasm-tools label Mar 28, 2024
@marcpopMSFT
Copy link
Member

@ChenhuiYuan01 is it only dotnet new that gets stuck or other commands as well? CC @MiYanni @joeloff if this is a templating issue.

@ChenhuiYuan01
Copy link
Member Author

@marcpopMSFT
Here are the commands we tried
1.dotnet--info , dotnet --version , dotnet --list-sdks -->Run successfully
2.dotnet new , dotnet restore, dotnet build , dotnet run , dotnet publish -->Stuck

@ChenhuiYuan01
Copy link
Member Author

This issue is also repro on 9.0.100-preview.4.24178.10 from https://github.com/dotnet/installer?tab=readme-ov-file.
Screenshot 2024-03-29 155235
info

@marcpopMSFT
Copy link
Member

@marcpopMSFT Here are the commands we tried 1.dotnet--info , dotnet --version , dotnet --list-sdks -->Run successfully 2.dotnet new , dotnet restore, dotnet build , dotnet run , dotnet publish -->Stuck

Seems like we're blocked on alpine arm but I don't know the how critical that platform is.

@nkolev92 @dsplaisted from the above list, it kind of looks like any command that hits nuget is hanging. dotnet new doesn't build but does do a restore after creating the project. Thoughts on this one?

@marcpopMSFT
Copy link
Member

I assume this is alpine arm64 only. If so, based on telemetry data I would not consider this a blocker. I checked the data and the arm64 alpine usage is <1% of all alpine usage and alpine usage is <1% of linux usage (from the data we have access to).

@dsplaisted
Copy link
Member

It could be something with restore, or with MSBuild. I would guess that it's something to do with the network access that restore does. We could do dotnet new and build with the options to skip restore to help validate where the hang is occurring. But really I think someone needs to attach a debugger or inspect a dump or something to figure out what's going on.

@lbussell
Copy link
Contributor

lbussell commented Mar 29, 2024

@marcpopMSFT @dsplaisted This is blocking Alpine Arm32 official .NET container images. Observing this with the same 9.0.100-preview.3.24175.24 version. We're not seeing this in Arm64 Docker images though. I put together this repro Dockerfile because I intended to file an issue, but found this existing issue instead:

FROM arm32v7/alpine:3.19

ENV \
    # Configure web servers to bind to port 8080 when present
    ASPNETCORE_HTTP_PORTS=8080 \
    # Enable detection of running in a container
    DOTNET_RUNNING_IN_CONTAINER=true \
    # Do not generate certificate
    DOTNET_GENERATE_ASPNET_CERTIFICATE=false \
    # Do not show first run text
    DOTNET_NOLOGO=true \
    # SDK version
    DOTNET_SDK_VERSION=9.0.100-preview.3.24175.24 \
    # Set the invariant mode since ICU package isn't included
    DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=true \
    # Enable correct mode for dotnet watch (only mode supported in a container)
    DOTNET_USE_POLLING_FILE_WATCHER=true \
    # Skip extraction of XML docs - generally not useful within an image/container - helps performance
    NUGET_XMLDOC_MODE=skip \
    # PowerShell telemetry for docker image usage
    POWERSHELL_DISTRIBUTION_CHANNEL=PSDocker-DotnetSDK-Alpine-3.19-arm32

RUN apk add --upgrade --no-cache \
        ca-certificates-bundle \
        \
        # .NET dependencies
        libgcc \
        libssl3 \
        libstdc++ \
        zlib \
        curl

# Install .NET SDK
RUN wget -O dotnet.tar.gz https://dotnetbuilds.azureedge.net/public/Sdk/$DOTNET_SDK_VERSION/dotnet-sdk-$DOTNET_SDK_VERSION-linux-musl-arm.tar.gz \
    && dotnet_sha512='c0a702b295f275b135d7fc845322b71ff9298fc35771fa0ae4118e5766d3d2c16a3658757a7b8cc41ec095a8c532b58322c91c1239ba851d48ba30480932fb95' \
    && echo "$dotnet_sha512  dotnet.tar.gz" | sha512sum -c - \
    && mkdir -p /usr/share/dotnet \
    && tar -oxzf dotnet.tar.gz -C /usr/share/dotnet \
    && rm dotnet.tar.gz \
    && ln -s /usr/share/dotnet/dotnet /usr/bin/dotnet

# Run any arbitrary .NET command
RUN dotnet --version \
    && dotnet help

It can be built with the command docker build --progress=plain --platform linux/arm/v7 -t repro -f ./Dockerfile .

I used an arm64 machine in DTL to repro it. It also hangs under Docker qemu emulation on an amd64 machine.

Related: dotnet/dotnet-docker#5309

@NicoleWang001
Copy link
Member

@marcpopMSFT is this blocking .NET 9 Preview 3 release?

@marcpopMSFT
Copy link
Member

@lbussell if you hit this, did you happen to be able to collect a dump or attach a debugger to get a callstack? Could you try a dotnet build --no-restore and see if it doesn't hang then? I'm inclined to believe it's a restore hang but that would prove it so we can route to @dotnet/nuget-team .

And @NicoleWang001 no idea what the priority of this specific config is. That's a question for tactics.

@lbussell
Copy link
Contributor

lbussell commented Apr 1, 2024

@marcpopMSFT Sorry, my repro was not clear enough. The dotnet help line is what hangs. I can try to get a dump from the dotnet executable for that, if you think that will help.

@marcpopMSFT
Copy link
Member

You're saying it hangs when running dotnet --help? That line also has a dotnet new console on it so I assumed it was that. Please do if you can.

@lbussell
Copy link
Contributor

lbussell commented Apr 1, 2024

You're saying it hangs when running dotnet --help? That line also has a dotnet new console on it so I assumed it was that. Please do if you can.

@marcpopMSFT yes it hangs when running dotnet --help. Interestingly it hangs after printing the help command's text.

I cannot get a dump without some further help with this - dotnet tool install --global dotnet-dump hangs, obviously, and the dotnet dump tool does not provide a linux-musl-arm binary. Arm64 does not hang.

@marcpopMSFT
Copy link
Member

@lbussell can you install a different version of the SDK that's working in a different location in order to install the dump tool?

@marcpopMSFT
Copy link
Member

CC @elinor-fung @agocke in case the issue is in the host itself as --help shouldn't do much.

@lbussell
Copy link
Contributor

lbussell commented Apr 1, 2024

@lbussell can you install a different version of the SDK that's working in a different location in order to install the dump tool?

@marcpopMSFT

Here's what I'm trying -

  1. Use latest known working SDK to install the dotnet-dump tool. Seems the last known good SDK was 9.0.0-preview.3.24162.31
  2. Copy the working SDK and the dotnet-dump tool to the repro image.
  3. Make sure the working version of .NET is on the PATH for the dotnet-dump tool to find.
  4. Build repro image - docker build --progress=plain --platform linux/arm/v7 -t repro -f ./Dockerfile .
  5. Run a dotnet command in the container to let it hang - docker run --rm -d --name repro-container repro /bin/sh -c '/usr/share/dotnet-broken/dotnet --help'
  6. Confirm the dotnet --help process PID (it's usually 1) - docker exec repro-container /bin/sh -c 'export DOTNET_ROOT=/usr/share/dotnet-working/ && /root/.dotnet/tools/dotnet-dump ps'
  7. Run dotnet-dump tool to collect a dump - docker exec repro-container /bin/sh -c 'export DOTNET_ROOT=/usr/share/dotnet-working/ && /root/.dotnet/tools/dotnet-dump collect --diag --type Mini -p 1'

The dotnet-dump tool hangs forever in this configuration, even trying to collect a "Mini" dump. /cc @mikem8361

Here's the updated Dockerfile I'm using to do this -

FROM mcr.microsoft.com/dotnet/nightly/sdk:9.0-preview-alpine3.19-arm32v7 as installer

RUN dotnet tool install --global dotnet-dump

FROM arm32v7/alpine:3.19

ENV ... same as above

RUN apk add ... same as above

COPY --from=installer [ "/usr/share/dotnet/", "/usr/share/dotnet-working/" ]
COPY --from=installer [ "/root/.dotnet/tools", "/root/.dotnet/tools" ]

# Install .NET SDK
RUN wget -O dotnet.tar.gz https://dotnetbuilds.azureedge.net/public/Sdk/$DOTNET_SDK_VERSION/dotnet-sdk-$DOTNET_SDK_VERSION-linux-musl-arm.tar.gz \
    && mkdir -p /usr/share/dotnet-broken \
    && tar -oxzf dotnet.tar.gz -C /usr/share/dotnet-broken \
    && rm dotnet.tar.gz \
    && ln -s /usr/share/dotnet-working/dotnet /usr/bin/dotnet

@marcpopMSFT
Copy link
Member

Per @agocke in tactics, moving to runtime repo.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 2, 2024
@marcpopMSFT marcpopMSFT transferred this issue from dotnet/sdk Apr 2, 2024
@jeffschwMSFT
Copy link
Member

@mangod9 can you take a look? not sure where this high cpu may be coming from. also including @elinor-fung

@mangod9
Copy link
Member

mangod9 commented Apr 3, 2024

@ChenhuiYuan01, assuming this is a consistent repro are you able to capture a dump with dotnet-dump when the high cpu is occurring? Thanks

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 3, 2024
@NicoleWang001
Copy link
Member

You're saying it hangs when running dotnet --help? That line also has a dotnet new console on it so I assumed it was that. Please do if you can.

@marcpopMSFT yes it hangs when running dotnet --help. Interestingly it hangs after printing the help command's text.

I cannot get a dump without some further help with this - dotnet tool install --global dotnet-dump hangs, obviously, and the dotnet dump tool does not provide a linux-musl-arm binary. Arm64 does not hang.

@mangod9 We could not get a dump either as the dotnet-dump hangs.
Also tried below, but still could not get a dump

  1. On Alpine Arm32 container, install .NET 8.0->install tool dotnet-dump ->install SDK .NET 9.0.100-preview.3.24175.24 and run dotnet--help ->dotnet-dump hangs as well
  2. On Alpine Arm32 container, install SDK .NET 9.0.100-preview.3.24175.24 and run dotnet--help. On host, install SDK .NET 9.0.100-preview.3.24175.24-> install tool dotnet-dump -> dotnet-dump failed with
    Unhandled exception: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
    ---> System.Net.Sockets.SocketException (111): Connection refused

@lauxjpn
Copy link

lauxjpn commented Apr 8, 2024

@NicoleWang001 I would suggest to generate a core dump without any .NET tool help (using OS bundled native tools). Then run dotnet-dump on a similar environment (that is not that failing docker container, e.g. a real arm32v7 environment) to analyze that previously collected core dump.

@lauxjpn
Copy link

lauxjpn commented Apr 8, 2024

I am experiencing hangs (stuck forever, manually aborted by me after 8 hours) for all dotnet commands when targeting arm32v7 once the SDK is installed (not if only the runtime is installed though), for .NET 6, 7, 8 and 9:

FROM mcr.microsoft.com/dotnet/nightly/sdk:9.0.100-preview.3-alpine3.19-arm32v7
WORKDIR /srv
RUN dotnet help
FROM mcr.microsoft.com/dotnet/sdk:8.0-alpine3.18-arm32v7
WORKDIR /srv
RUN dotnet help
FROM mcr.microsoft.com/dotnet/sdk:7.0-alpine3.18-arm32v7
WORKDIR /srv
RUN dotnet help
FROM mcr.microsoft.com/dotnet/sdk:6.0-alpine3.18-arm32v7
WORKDIR /srv
RUN dotnet help
docker build -f 'arm32v7-test.Dockerfile' --platform 'linux/arm32v7' --pull -t 'arm32v7-test:latest' ./empty/

Tested with a docker host that is Windows 10 x64 (with WSL2 integration to Ubuntu) or an Ubuntu 22.04/23.10 VM (VirtualBox, with a Windows 10 x64 host). So these are environments that use qemu to emulate arm32v7.

BTW, downloading and running the SDK in a custom build (buildroot) qemu busybox image targeting arm32v7 (so without docker) works fine. I have not yet compared my custom buildroot options against those of https://github.com/docker-library/busybox.


Here are results from COREHOST_TRACE=1:

  • COREHOST_TRACE=1 output for dotnet help in docker before it hangs: dotnet_trace_01.txt
  • COREHOST_TRACE=1 output for dotnet help in custom working busybox qemu build (without docker): dotnet_trace_02.txt

Differences are the HOST_RUNTIME_CONTRACT and the following lines that are output in the working build in addition to the output of the non-working docker build (the docker build hangs before those lines):

Launch host: /usr/share/dotnet/dotnet, app: /usr/share/dotnet/sdk/8.0.203/dotnet.dll, argc: 1, args: help,
--- Begin breadcrumb write
Directory core breadcrumbs [] was not specified or found
Fallback directory core breadcrumbs at [opt/corebreadcrumbs] was not found
Breadcrumb store was not obtained... skipping write.
Execute managed assembly exit code: 0x0

/cc @richlander

@lbussell
Copy link
Contributor

@lauxjpn the behavior for the 6.0, 7.0, and 8.0 Docker images is unexpected. Can you please post an issue for that in https://github.com/dotnet/dotnet-docker?

@janvorli
Copy link
Member

I have done some inital investigations. I have found the issue is related to the new exception handling that was enabled in preview 3 by default. There is some exception that occurs on a secondary thread during the shutdown of the process and the function that the new exception handling uses to iterate over stack frames keeps returning the same frame over and over. This is actually an explicit frame (InlinedCallFrame) which has its internal "next" pointer pointing to itself. I don't have any idea how that could happen at the moment, but that causes the infinite loop there.
Until the issue is fully understood and fixed, it can be mitigated by enabling the legacy exception handling by setting the env variable DOTNET_LegacyExceptionHandling=1.

janvorli added a commit to janvorli/runtime that referenced this issue Apr 16, 2024
There is an edge case during exception handling on arm32 where an active
InlinedCallFrame is not popped from the explicit frame list. That later
leads to various kinds of failures / crashes. For example, the on Alpine
arm32, the `dotnet help` hangs eating 100% of one CPU core. That happens
due to code executing after the exception was handled and its stack
overwriting the explicit frame contents.

This can only occur when the pinvoke is inlined in a method that calls it
inside of a try region with catch in the same method and exception
occurs e.g. due to the target native function or the shared library not
existing.

What happens is that when we pop the explicit frame, we pop frames that
are below the SP of the resume location after catch. But the
InlinedCallFrame is in this case above that SP, as it was created in the
prolog of the method.

To fix that, we need to pop that frame too. The fix uses the same
condition as the old EH was using.

Closes dotnet#100536
janvorli added a commit to janvorli/runtime that referenced this issue Apr 16, 2024
There is an edge case during exception handling on arm32 where an active
InlinedCallFrame is not popped from the explicit frame list. That later
leads to various kinds of failures / crashes. For example, the on Alpine
arm32, the `dotnet help` hangs eating 100% of one CPU core. That happens
due to code executing after the exception was handled and its stack
overwriting the explicit frame contents.

This can only occur when the pinvoke is inlined in a method that calls it
inside of a try region with catch in the same method and exception
occurs e.g. due to the target native function or the shared library not
existing.

What happens is that when we pop the explicit frame, we pop frames that
are below the SP of the resume location after catch. But the
InlinedCallFrame is in this case above that SP, as it was created in the
prolog of the method.

To fix that, we need to pop that frame too. The fix uses the same
condition as the old EH was using.

Closes dotnet#100536
@janvorli janvorli added area-ExceptionHandling-coreclr in-pr There is an active PR which will close this issue when it is merged and removed area-VM-coreclr untriaged New issue has not been triaged by the area owner labels Apr 17, 2024
@jkotas jkotas closed this as completed in cb978a5 Apr 20, 2024
@github-project-automation github-project-automation bot moved this from Tracking to Done in .NET Docker Apr 20, 2024
matouskozak pushed a commit to matouskozak/runtime that referenced this issue Apr 30, 2024
* Fix missing explicit frame pop on arm32

There is an edge case during exception handling on arm32 where an active
InlinedCallFrame is not popped from the explicit frame list. That later
leads to various kinds of failures / crashes. For example, the on Alpine
arm32, the `dotnet help` hangs eating 100% of one CPU core. That happens
due to code executing after the exception was handled and its stack
overwriting the explicit frame contents.

This can only occur when the pinvoke is inlined in a method that calls it
inside of a try region with catch in the same method and exception
occurs e.g. due to the target native function or the shared library not
existing.

What happens is that when we pop the explicit frame, we pop frames that
are below the SP of the resume location after catch. But the
InlinedCallFrame is in this case above that SP, as it was created in the
prolog of the method.

To fix that, we need to pop that frame too. The fix uses the same
condition as the old EH was using.

Closes dotnet#100536

* Remove forcing crossgen and filtering by target arch for the test

* Reflect PR feedback

---------

Co-authored-by: Jan Vorlicek <jan.vorlicek@volny,cz>
@github-actions github-actions bot locked and limited conversation to collaborators May 21, 2024
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this issue May 30, 2024
* Fix missing explicit frame pop on arm32

There is an edge case during exception handling on arm32 where an active
InlinedCallFrame is not popped from the explicit frame list. That later
leads to various kinds of failures / crashes. For example, the on Alpine
arm32, the `dotnet help` hangs eating 100% of one CPU core. That happens
due to code executing after the exception was handled and its stack
overwriting the explicit frame contents.

This can only occur when the pinvoke is inlined in a method that calls it
inside of a try region with catch in the same method and exception
occurs e.g. due to the target native function or the shared library not
existing.

What happens is that when we pop the explicit frame, we pop frames that
are below the SP of the resume location after catch. But the
InlinedCallFrame is in this case above that SP, as it was created in the
prolog of the method.

To fix that, we need to pop that frame too. The fix uses the same
condition as the old EH was using.

Closes dotnet#100536

* Remove forcing crossgen and filtering by target arch for the test

* Reflect PR feedback

---------

Co-authored-by: Jan Vorlicek <jan.vorlicek@volny,cz>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-ExceptionHandling-coreclr in-pr There is an active PR which will close this issue when it is merged
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

10 participants