-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All node processes hang in uninterruptible deep sleep on linux #55587
Comments
I can't reproduce. It's possible this is machine specific / an issue with npm, does running normal Node.js scripts work? |
I have the same problem on NixOS 24.05 (Linux 6.6.58 #1-NixOS SMP PREEMPT_DYNAMIC Tue Oct 22 13:46:36 UTC 2024 x86_64 GNU/Linux) with node version 20.15.1. I can reproduce the issue sometimes with node version 18.19.1, so this issue is probably a linux kernel/driver issue. Maybe the culprit is linux 6.6.58 |
It seems to work on Linux 6.6.53. |
Unkillable processes are, by definition, operating system bugs. That's out of node's control, there's nothing we can do. FWIW, I can tell from the epoll_pwait calls that it's suspending for ~80ms every other call (probably a setInterval timer), so it is making progress. |
There are ways to trigger uninterruptible sleep states from a user space process, instances where the kernel is not misbehaving. Furthermore, the issue only occurs with nodejs. I am not saying that nodejs is at fault here, though. It might as well be a kernel issue. The bun alternative to node has no such issues. pnpm and yarn have the same problems. Now, knowing that @vherrmann has circumvented the issue by using another kernel, things seem to point to the kernel. However, the kernel 6.6 is a LTS kernel. Issues like these would've popped up by the dozens, I would think. Is there some more debugging advisable? |
I'd report this to your kernel vendor (nix?), maybe they're floating a bad patch. |
I can spin up a VM and see. Maybe I'll come out w/ a kernel patch or something |
I have the same issue with NixOS here:
The project running is a vite + SvelteKit dev server, source available here: https://github.com/jwpconsulting/projectify . The frontend/ sub directory has a flake available and the server can be run with |
on on For me,
however the
Not sure when I'll have time to bisect this; for now I have rebooted into the earlier NixOS that doesn't exhibit the problem so I can get back to work. :) |
I observed the same problem on Gentoo Linux with a custom built kernel 6.6.58. After updating to kernel 6.6.59, the problem disappeared for me. But please take this with grain of salt since I didn't further investigate the low level root cause. However, an uninterruptible suspend state of the node process does suggest a problem on kernel level. |
Maybe we're looking at io_uring bug #umpteen here. Do the hangs go away when you |
seems to be a kernel bug, i have the same error on kernel version 6.6.58 and now it works fine after i updated to kernel version 6.11.5 |
It just happened again, but now on kernel version 6.6.59. Unfortunately it only happens sporadically. So I cannot tell for sure, if the parameter above is helping or not. Will try though. |
Is there something actionable from the Node.js-side? Except for disabling io_uring? |
Seeing this as well. Wonder if there is anything in common as well with the filesystem folks are using (I have ZFS on NixOS). |
i am using BTRFS, i don't think it's correlated to the filesystem. |
also experiencing this issue xfs since 2024-10-25 (maybe because i rebooted that day) on 6.6.58 Been working around this issue with |
Happens to me too. I can reproduce the crash consistently by running a vite dev server on any project. After a while it just crashes. Setting UV_USE_IO_URING=0 doesn't change anything on the new kernel EDIT: i believe that the issue lies in the kernel and/or vite and/or v8, can anyone point to where we should report this bug? This happens also with bun and node 18.20.1 $ node -v
v20.15.1 Before kernel upgrade: $ uname -a
Linux nixos 6.6.57 #1-NixOS SMP PREEMPT_DYNAMIC Thu Oct 17 13:24:38 UTC 2024 x86_64 GNU/Linux After upgrade: $ uname -a
Linux nixos 6.11.4 #1-NixOS SMP PREEMPT_DYNAMIC Thu Oct 17 13:27:02 UTC 2024 x86_64 GNU/Linux npm run dev (vite) (Kernel v6.11.4)
journalctl -xe (Kernel v6.11.4)
Using the old kernel the process gets in uninterruptible sleep. journalctl -xe (Kernel 6.6.57):
npm run dev (vite) (Kernel 6.6.57):
|
I have identified the likely problematic commit as gregkh/linux@f4ce3b5. Reverting that atop 6.6.59 resolves the issue. As this seems to be a Kernel bug, I have reported this to the Kernel io_uring mailing list, which hopefully will receive some feedback. |
If I'm not mistaken the kernel issue involves only the process going to uninterruptible sleep. Why the io_ring gets overflown and the vite process crashes is not yet understood I believe. |
Theoretically we're dealing with CQ being overflown https://github.com/libuv/libuv/blob/v1.x/src/unix/linux.c#L1220-L1237, but we could be doing smthg wrong. Can smbdy provide exact instructions to reproduce the issue? |
Since I am affected by this issue and can reliably reproduce it with my particular SvelteKit dev env, I will try to narrow it down to a minimal snippet, preferrably in a VM (NixOS on QEMU with all the deps from my dev env) |
Hey, if this is a Kernel bug, as has been mentioned prior, can this be closed–or is there still discussion relevant to Node.js? |
The io_uring Kernel maintainers identified a missing backport (gregkh/linux@8d09a88), and have requested that be backported. I have confirmed it does resolve the issue on Kernel 6.6, at least for my reproducer. While I do believe this is a Kernel bug, the io_uring had this to say as well:
So it’s possible that there are improvements to be made here as well, but I don’t have much context. I do have a reproducer in a Nix-based declarative VM (which I used to triage Kernel changes and identify the problematic commit), I can post it later if it’s wanted by folks to address any Node.js concerns as suggested above. |
I'm ok closing this in favor of libuv/libuv#4598. Please, if possible, post a reproducer there so we can take a look. Thanks! |
@amarshall, @justuswilhelm and all those affected: do you happen to use devenv? server: {
watch: {
followSymlinks: false
}
}, I found that the folder .devenv is included in the list of folders that vite watches. Inside my .devenv/profile/include i found that ncurses has symlinks to itself, so chokidar tries to recursively set watchers on the same files but following different paths:
To the nodejs maintainers: sorry for wasting your time. At least we caught a kernel bug :) |
I use direnv (and have used devenv) in the same project, but in a different folder from where I'm using nodejs. |
Thank you. I've posted a reproducing example there:
|
FYI, Kernel 6.6.60 has been released, and includes the backported fix for this issue. @hjeldin No, I had this occur when performing |
I would also like to mention that this has also been backported to 6.1.116. |
I am currently encountering a similar issue. #8735 (comment) Could you give me some ideas on how to continue tracking the root cause? |
Attach with gdb, then |
@bnoordhuis thank you for your reply. There was another
System: |
Version
v20.17.0
Platform
Subsystem
No response
What steps will reproduce the bug?
Serving or installing any previously working node project results in the node process hanging uninterruptably.
The SIGKILL signal is not handled, it seems the syscall handler of syscall 281 (epoll_pwait) goes to deep sleep, and as a kernel thread, suspends signal handling until completion.
However, there is also other syscalls still happening.
Furthermore, it seems to do some more work. Here is the output of running:
How often does it reproduce? Is there a required condition?
Unconditionally reproducible in several node projects. Even simple vite official templates.
What is the expected behavior? Why is that the expected behavior?
The process completes the installation of the node dependencies in a resonably time, all the while dipslaying status information.
If the process is desired to be killed, sending the SIGTERM or SIGKILL signal should kill the process in reasonable time.
What do you see instead?
Instead, the process hangs indefinitely.
Furthermore, killing the process is impossible.
Additional information
No response
The text was updated successfully, but these errors were encountered: