Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel 6.6.57+ io_uring stall ("yarn install takes indefinitely") #353709

Closed
gador opened this issue Nov 4, 2024 · 41 comments
Closed

Kernel 6.6.57+ io_uring stall ("yarn install takes indefinitely") #353709

gador opened this issue Nov 4, 2024 · 41 comments
Labels
0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer 1.severity: blocker This is preventing another PR or issue from being completed 6.topic: kernel The Linux kernel 6.topic: nodejs

Comments

@gador
Copy link
Member

gador commented Nov 4, 2024

Update:

The root cause is due to a linux kernel regression on the build system. The affected kernels are:

  • 6.6.56: succeeds
  • 6.6.57: fails
  • 6.6.58: fails
  • 6.6.59: fails
  • 6.6.59 (with f4ce3b5 reverted): succeeds
  • 6.11.6: succeeds

and the first one landed in nix in 0e4c64f

This kernel regression causes npm and yarn to hang and be unkillable.

It has nothing to do with nix sandboxing, as I've first suspected.


Describe the bug

Currently yarn install hangs at the step linking dependencies...

Steps To Reproduce

Steps to reproduce the behavior:
1.Try to build pgadmin4 on master
2.Wait for linking dependencies...
3. ...

or just run nix build github:nixos/nixpkgs/71e91c409d1e654808b2621f28a327acfdad8dc2#pgadmin --rebuild

Expected behavior

yarn install should continue with the install process

Additional context

I've noticed this issue on an unrelated small bugfix in pgadmin4 which caused a rebuild, which did not work. (#353092). Ofborg worked just fine, which is why I merged this small fix, but the package never did build on my system. Neither does it currently on hydra (See e.g. https://hydra.nixos.org/build/277185860/nixlog/1)

I'm not sure what changed, since nothing substantially changed on the package. I've also tried to re-run the update script which resulted in exactly the same yarn.lock.

Running strace or lsof did not result in any trace of the issue.

Also, interestingly, running --check on an older nixos-unstable pgadmin4 derivation fails to build at the same step.

Is there anything in the nix builder, which changed sandbox or build behavior which stalled yarn ?
I've looked at NixOS/nix#10312 which changed stuff related to the sandbox and found an old unpatched nix version in 24.05 (which is running nix version 2.18.2 which according to GHSA-q82p-44mg-mgh5 hasn't been fixed, yet) and it does compile the current pgadmin4 just fine!

This does not work with a patched nix version (doesn't matter whether its 2.18.4 or newer)

So the patch to fix the build-dir seems to have broken at least pgadmin.

Notify maintainers

@roberth

Metadata


Add a 👍 reaction to issues you find important.

@gador gador added the 0.kind: bug Something is broken label Nov 4, 2024
@FliegendeWurst
Copy link
Member

I've had this issue too. It doesn't just hang, in goes into disk sleep. Meaning you can't kill it, not even by shutting down the system.

@gador
Copy link
Member Author

gador commented Nov 4, 2024

Yes! Not even sudo kill -9 $PID does help. Only restarting the whole system works. I'm trying to disect, where it actually goes wrong, but I believe it has something to do with the new chroot safety feature from NixOS/nix@0e4baff

@gador
Copy link
Member Author

gador commented Nov 4, 2024

I confirmed my suspicion.
I have the following diff on the current 2.24.10 version

diff --git a/src/libstore/unix/build/local-derivation-goal.cc b/src/libstore/unix/build/local-derivation-goal.cc
index 2a09e3dd4..baeae54f8 100644
--- a/src/libstore/unix/build/local-derivation-goal.cc
+++ b/src/libstore/unix/build/local-derivation-goal.cc
@@ -509,11 +509,11 @@ void LocalDerivationGoal::startBuilder()
     /* Create a temporary directory where the build will take
        place. */
     topTmpDir = createTempDir(settings.buildDir.get().value_or(""), "nix-build-" + std::string(drvPath.name()), false, false, 0700);
-#if __APPLE__
+//#if __APPLE__
     if (false) {
-#else
-    if (useChroot) {
-#endif
+//#else
+//    if (useChroot) {
+//#endif
         /* If sandboxing is enabled, put the actual TMPDIR underneath
            an inaccessible root-owned directory, to prevent outside
            access.

which basically reverts NixOS/nix@0e4baff
and used this as nix.package in a VM to test the build. I then ran nix build github:nixos/nixpkgs/nixos-unstable#pgadmin4 --rebuild -L and it did work !

Doing this on any newer nix version without the above diff fails. So this is exactly the reason. yarn (for whatever reason) does either not like the subdirectory /build (which is unlikely) or the permission 700.

Not sure how to tackle this problem, though. It is unlikely that pgadmin is the only victim here. And that you have to restart the whole system to kill a bunch of node yarn install ... processes isn't cool either.

@thufschmitt any idea here? Also, in light of ZHF #352882 a bit of a pressing problem

@roberth
Copy link
Member

roberth commented Nov 4, 2024

I haven't seen this before. I'm not much of a darwin expert, but here's my thoughts. (not a darwin issue!)

The directory names got longer, and unix sockets have a very restricted length on darwin. Some software does not expect a long(er) TMPDIR and may not handle that correctly, leading to undefined/strange behavior.

Although strace didn't reveal much, it might be worth comparing a hanging run to a successful run, especially if the execution is deterministic, which makes a semi-automated comparison much easier.

Is each node in this chain of directories that makes up TMPDIR readable (+rx) by the sandboxed build process? If not, would it be ok to make it readable only by the build user? This is slightly less secure, but might be ok.

This could probably be fixed on either side, Nix or yarn. Could you open an issue on the https://github.com/NixOS/nix repo for the regression? It'd help to get more eyes on this. (I'd move the issue if it was clearly one or the other, fwiw)

Another practical note: @thufschmitt has changed jobs and isn't contributing actively to the Nix/NixOS ecosystem anymore.

@FliegendeWurst
Copy link
Member

I haven't seen this before. I'm not much of a darwin expert, but here's my thoughts.

I have the same issue on Linux. There is nothing really suspicous in lsof either.

pnpm    191060 nixbld1 cwd       DIR   0,36       40    407492 /build/source (deleted)
pnpm    191060 nixbld1 rtd       DIR  259,2     4096  41432658 /
... lots of /nix/store paths, anon_inode io_uring, pipes ...

@Shawn8901
Copy link
Contributor

As I don't see it explicitly named: I think it is definitely not yarn only (pnpm is shown in the prev. comment). I had observed similar issue, when trying to build stalwart-mail.webadmin when trying to reproduce a recent build failure.
The said package uses npm (same symptoms: never finished, 0 activity, can't kill -9, shutdown blocked).

I was running a maybe 1-2 weeks old nixos-unstable.
Let me know if I should try to reproduce and gather some Infos.

@gador
Copy link
Member Author

gador commented Nov 5, 2024

@roberth thanks for chiming in. This is a non darwin issue. As it is only present when the code is executed on an non APPLE system.
I can build pgadmin just fine on 2.24.9 on aarch64-darwin. My "patch" above just disables the chroot condition for all systems

Also, even worse, when trying to build pgadmin on linux: Due to being unkillable, the system will not reboot nor shutdown! It will hang forever on a watchdog issue and the system needs to be powered down by hand. This can be a huge issue for bare-metal servers

@gador
Copy link
Member Author

gador commented Nov 5, 2024

Is each node in this chain of directories that makes up TMPDIR readable (+rx) by the sandboxed build process?

AFAIS, yes.

ls -la /tmp
drwx------  3 root  root     3 Nov  5 06:21 nix-build-pgadmin-8.11.drv-1
sudo ls -la /tmp/nix-build-pgadmin-8.11.drv-1
drwx------  5 nixbld1 nixbld  7 Nov  5 06:21 build
sudo ls -la /tmp/nix-build-pgadmin-8.11.drv-1/build
total 64
drwx------ 5 nixbld1 nixbld     7 Nov  5 06:21 .
drwx------ 3 root    root       3 Nov  5 06:21 ..
drwxr-xr-x 3 nixbld1 nixbld     3 Nov  5 06:21 .cache
-rw------- 1 nixbld1 nixbld 35469 Nov  5 06:21 env-vars
drwxr-xr-x 9 nixbld1 nixbld    19 Nov  5 06:21 source
drwxr-xr-x 3 nixbld1 nixbld     3 Nov  5 06:21 v8-compile-cache-1000
-rw-r--r-- 1 nixbld1 nixbld   160 Nov  5 06:21 .yarnrc
sudo ls -la /tmp/nix-build-pgadmin-8.11.drv-1/build/.cache/yarn/v6
[...]
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-yarn-audit-html-4.0.0-dc04c9cf83e758fd6d9efad8c96df1fc8c4bf30c
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-yauzl-2.10.0-c7eb17c93e112cb1086fa6d8e51fb0667b79a5f9
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-yocto-queue-0.1.0-0294eb3dee05028d31ee1a5fa2c556a6aaf10a1b
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-yocto-queue-1.1.1-fef65ce3ac9f8a32ceac5a634f74e17e5b232110
drwxr-xr-x    3 nixbld1 nixbld    3 Nov  5 06:21 npm-zustand-4.5.4-63abdd81edfb190bc61e0bbae045cc4d52158a05
drwxr-xr-x    2 nixbld1 nixbld    2 Nov  5 06:21 .tmp

This could probably be fixed on either side, Nix or yarn. Could you open an issue on the https://github.com/NixOS/nix repo for the regression?

done

@blurgyy
Copy link

blurgyy commented Nov 5, 2024

Also seeing this on a x86-64 linux machine running hydra, the command npm ci runs forever and kill -9 does nothing.

@datafoo
Copy link
Contributor

datafoo commented Nov 5, 2024

Same here on my x86-64 linux development VM. I did a nixos-rebuild switch --upgrade yesterday and since then the problem happens with npm ci and npm install.

@gador
Copy link
Member Author

gador commented Nov 5, 2024

@datafoo when was your last known good commit?

@gador
Copy link
Member Author

gador commented Nov 5, 2024

I investigated further and I narrowed it down to something between these commits:

broken 4c2fcb0 2024-10-18 1809433
good a3c0b3b 2024-10-14 1809364

Tested as the input for a NixOS VM with a fixed nix.package = pkgs.nixVersions.nix_2_24; and always trying to build nix build -L --rebuild github:nixos/nixpkgs/nixos-unstable#pgadmin4
With the broken commit, this stalls. With the good commit this continues on and builds. Since the derivation to build is fixed (and so are all the inputs e.g. yarn or node), this obviously has something to do with the build environment. And this changed between those commits.

I haven't found an easy culprit with git --diff, yet.

@Garmelon
Copy link
Contributor

Garmelon commented Nov 5, 2024

On my system, manually (as in: typing it into my terminal) running npm ci in a repo also hangs the npm ci process. The build is not running through nix. The process is un-sigkill-able.

My system is running on nixpkgs commit 807e915.

Edit: Steps to reproduce (at least on my machine):

  1. cd into a project that uses npm. (I don't yet know if this works on all repos or only more complicated ones.)
  2. Run rm -r node_modules
  3. Run npm ci. Note that this time, it completes and exits successfully, as expected.
  4. Run npm ci immediately afterwards (may be time sensitive). Note that it appears to hang, the little spinner spinning indefinitely, without any other output.
  5. Press Ctrl+C. Note that the npm ci process still exists, but now in its un-SIGKILL-able state. Since the process still exists, you are not dumped back in your shell prompt either.

I kept running npm ci in different ways (but in the same repository). Roughly every second npm ci call seemed to get stuck. These patterns seemed to hold most of the time:

  1. After a successful run of npm ci, an immediate rerun seems to get stuck.
  2. After a stuck run of npm ci, an immediate rerun seems to succeed.
  3. After a successful run of npm ci, a rerun after a wait of a minute or so seems to succeed or get stuck randomly.
  4. After a stuck run of npm ci, a rerun after a wait of a minute or so seems to succeed.

@donovanglover
Copy link
Member

@donovanglover
Copy link
Member

Bun and Deno seem to not be affected.

@donovanglover donovanglover added 1.severity: blocker This is preventing another PR or issue from being completed 0.kind: regression Something that worked before working no longer labels Nov 6, 2024
@gador
Copy link
Member Author

gador commented Nov 6, 2024

@Garmelon I think this is an unrelated bug. What I described here is a bug in a build process from nix, which always uses the same node and yarn version and fails or succeeds depending on the host machine's NixOS version. This is why I suspect nix to be involved.
@donovanglover I cannot rule out a random hang on the build process. But as of know it consistently works or consistently fails depending on the commit of the build machine

@K900
Copy link
Contributor

K900 commented Nov 6, 2024

This is a kernel bug, specifically with io_uring.

Edit: source: https://lore.kernel.org/io-uring/2024110620-stretch-custodian-0e7d@gregkh/T/#u

@datafoo
Copy link
Contributor

datafoo commented Nov 6, 2024

@datafoo when was your last known good commit?

The last generation that works for me is 310:

$ nixos-version 
24.05.5562.1bfbbbe5bbf8 (Uakari)

$ nix --version 
nix (Nix) 2.18.8

$ which nix
/run/current-system/sw/bin/nix

$ l /run/current-system/sw/bin/nix
lrwxrwxrwx 1 root root 62 1970-01-01 01:00 /run/current-system/sw/bin/nix -> /nix/store/x6b4rr799djkf8a2abwf59fadcbyasc1-nix-2.18.8/bin/nix

$ uname --all
Linux redacted 6.6.54 #1-NixOS SMP PREEMPT_DYNAMIC Fri Oct  4 14:30:05 UTC 2024 x86_64 GNU/Linux

The first generation that fails for me is 311:

$ nixos-version
24.05.6122.080166c15633 (Uakari)

$ nix --version 
nix (Nix) 2.18.8

$ which nix
/run/current-system/sw/bin/nix

$ l /run/current-system/sw/bin/nix
lrwxrwxrwx 1 root root 62 1970-01-01 01:00 /run/current-system/sw/bin/nix -> /nix/store/ikj1h47p1msvkg7nbyqxabk14n75pfwj-nix-2.18.8/bin/nix

$ uname --all
Linux redacted 6.6.58 #1-NixOS SMP PREEMPT_DYNAMIC Tue Oct 22 13:46:36 UTC 2024 x86_64 GNU/Linux

Observations:

  • Nix is at version 2.18.8 in both cases but the store paths are different.
  • The kernel versions are different.

@rbozan
Copy link

rbozan commented Nov 6, 2024

This is a kernel bug, specifically with io_uring.

Edit: source: lore.kernel.org/io-uring/2024110620-stretch-custodian-0e7d@gregkh/T#u

I'm using this as a workaround for now:

boot.kernelPackages = pkgs.linuxPackages_latest;

@gador
Copy link
Member Author

gador commented Nov 6, 2024

This is a kernel bug, specifically with io_uring.

Edit: source: https://lore.kernel.org/io-uring/2024110620-stretch-custodian-0e7d@gregkh/T/#u

Thanks! I'm bisecting right now (which takes a long time), but since I'm rebuilding the kernel right now, this sounds about right!

@gador
Copy link
Member Author

gador commented Nov 6, 2024

Bisect is done:

0e4c64ff9cef5800c6a3f4838c66a918ceb61398 is the first bad commit
commit 0e4c64ff9cef5800c6a3f4838c66a918ceb61398
Author: K900 <[email protected]>
Date:   Thu Oct 17 17:12:25 2024 +0300

    linux_6_6: 6.6.56 -> 6.6.57

 pkgs/os-specific/linux/kernel/kernels-org.json | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

So, yes, this is a kernel bug

@gador
Copy link
Member Author

gador commented Nov 6, 2024

@K900 shall we revert this specific commit?

@K900
Copy link
Contributor

K900 commented Nov 6, 2024

No, a proper fix should be in the next batch of stable kernels.

@gador
Copy link
Member Author

gador commented Nov 6, 2024

Ok. I'll keep the issue open until the next batch landed and for everyone looking for a quick fix #353709 (comment) seems like a viable option (or reverting, or not updating)

@Tofandel
Copy link

Tofandel commented Nov 6, 2024

Do you have a /usr/bin/ldd file in NixOS? This is what determines if yarn uses node's getReport (which is buggy and make the process hang, and not randomly, it's a reliable bug). If you do then the bug has nothing to do with it

@SuperSandro2000
Copy link
Member

SuperSandro2000 commented Nov 6, 2024

We don't have /usr at all but ldd might be in PATH.

@Tofandel
Copy link

Tofandel commented Nov 6, 2024

It's hardcoded in yarn berry: https://github.com/yarnpkg/berry/blob/f59bbf9f3828865c14b06a3e5cc3ae284a0db78d/packages/yarnpkg-core/sources/nodeUtils.ts#L26

Btw if you are not using yarn berry (which is 95% of the users of yarn) then getReport is not used at all either

@mweinelt
Copy link
Member

mweinelt commented Nov 6, 2024

What irks me is that this issue mixes up apparent darwin sandboxing issues and linux kernel regressions into one, because they both block the build in a similiar fashion apparently.

@gador
Copy link
Member Author

gador commented Nov 6, 2024

What irks me is that this issue mixes up apparent darwin sandboxing issues and linux kernel regressions into one, because they both block the build in a similiar fashion apparently.

Its not darwin sandbox issues, but linux and the linux kernel regression is the root cause. I closed the nix issues as soon as the root cause was known. If you read the comments it should become apparant, but I can add an "edit" to the original post for clarity

@Garmelon
Copy link
Contributor

Garmelon commented Nov 6, 2024

For completeness, there's also nodejs/node#55587

@Eveeifyeve
Copy link
Contributor

This occurs in a pnpm build in modrinth-app.

@Frontear
Copy link
Member

Frontear commented Nov 7, 2024

+1 occurs in jellyfin-web, as reported by @SeamusFD. Haven't tested myself if rolling the kernel back/pushing ahead fixes it, but I can expect it probably does given this drv breaks for the same reasons as everything else here.

@mweinelt
Copy link
Member

mweinelt commented Nov 7, 2024

No reason to report back further until you run at least Kernel 6.6.60 or 6.12.5

@Garmelon
Copy link
Contributor

Garmelon commented Nov 7, 2024

6.12.5

As far as I understand, the bug is not present in non-lts kernels, i.e. 6.11 or later.

It should hopefully be fixed in lts kernels 6.1.116 and 6.6.60.

@mweinelt mweinelt changed the title yarn install takes indefinitely Kernel 6.6.57+ io_uring stall ("yarn install takes indefinitely") Nov 7, 2024
@donovanglover donovanglover added the 6.topic: kernel The Linux kernel label Nov 7, 2024
@dkumza
Copy link

dkumza commented Nov 8, 2024

6.12.5

As far as I understand, the bug is not present in non-lts kernels, i.e. 6.11 or later.

It should hopefully be fixed in lts kernels 6.1.116 and 6.6.60.

thanks i added latestet kernel to my config and I have no issues !
boot.kernelPackages = pkgs.linuxPackages_latest;

@gador
Copy link
Member Author

gador commented Nov 8, 2024

A comment on this workaround: This only works for non ZFS configs. The zfs config (at least the stable ones) rely on a kernel <= 6.11 so the only workaround here is a revert/rollback until 6.6.60 is released

@Ramblurr
Copy link
Contributor

Ramblurr commented Nov 8, 2024

A comment on this workaround: This only works for non ZFS configs. The zfs config (at least the stable ones) rely on a kernel <= 6.11

Wow thanks for this, I was just fighting with my config trying to make this work.

I see zfs_unstable supports 6.11 now, but when I tried to enable that on a test machine:

  boot.kernelPackages = lib.mkForce pkgs.linuxPackages_6_11;
  boot.zfs.package = lib.mkForce pkgs.zfs_unstable;

# output:
error: Package ‘zfs-kernel-2.2.6-6.11.5’ in /nix/store/wb6agba4kfsxpbnb5hzlq58vkjzvbsk6-source/pkgs/os-specific/linux/zfs/generic.nix:216 is marked as broken, refusing to evaluate.

I'm not sure why it's pulling in 2.2.6 when zfs_unstable is 2.3, but anyways that is off-topic here probably. Any not all systems can use zfs unstable.

.. so the only workaround here is a revert/rollback until 6.6.60 is released

Rolling back to LTS 6.1 isn't an option unfortunately. Do I understand right kernel 6.6.56 is the last-known-good?

What would be the proper way to revert/rollback to 6.6.56?

@gador
Copy link
Member Author

gador commented Nov 8, 2024

The way I did it, was to selectively revert the update commits on my own nixpkgs fork:

commit 614327d436779e271ad599da3673dd9a2a8df873
Author: Florian Brandes <[email protected]>
Date:   Wed Nov 6 14:08:02 2024 +0100

    Revert "linux_6_6: 6.6.56 -> 6.6.57"

    This reverts commit 0e4c64ff9cef5800c6a3f4838c66a918ceb61398.

commit 4b7bc628a6d0bba9a94748d0acae013b5ee8dfff
Author: Florian Brandes <[email protected]>
Date:   Wed Nov 6 14:07:53 2024 +0100

    Revert "linux_6_6: 6.6.57 -> 6.6.58"

    This reverts commit c0b43de177e6cc1ac03f6b6bab64b8f4aecce996.

commit 0fa02103f6b272d5c7c8d1be7bff70c8aa52b1f8
Author: Florian Brandes <[email protected]>
Date:   Wed Nov 6 14:05:28 2024 +0100

    Revert "linux_6_6: 6.6.58 -> 6.6.59"

    This reverts commit 05b2afe1c8e9b43d05ca9f4b6bf6dba1222bc99a.

And of course you can revert your NixOS config to an older generation which worked and wait for the kernel update

@dbaynard
Copy link
Contributor

dbaynard commented Nov 8, 2024

This is a kernel bug, specifically with io_uring.
Edit: source: lore.kernel.org/io-uring/2024110620-stretch-custodian-0e7d@gregkh/T#u

I'm using this as a workaround for now:

boot.kernelPackages = pkgs.linuxPackages_latest;

There's no ZFS in the current pkgs.linuxPackages_latest so pinning to a nixpkgs with 6.6.56 will have to do, for me, until 6.6.60 is out. [edit: I see I missed the discussion of this — serves me right for not re-reading latest messages before posting. Try 5633bcf from October 9th.]

In my case (laptop, 6.6.59) shutdown is fine but delayed, (deep) sleep does not appear to resume.

Thanks, everyone.

@Eveeifyeve
Copy link
Contributor

This occurs in a pnpm build in modrinth-app.

As said before this is happening with pnpm not just yarn.

@K900
Copy link
Contributor

K900 commented Nov 9, 2024

Kernel 6.6.60 is merged now with a fix.

@K900 K900 closed this as completed Nov 9, 2024
sodiboo added a commit to sodiboo/system that referenced this issue Nov 22, 2024
critical security fix for Sharkey; but first kernel must be upgraded
because of NixOS/nixpkgs#353709 which caused
me pain for WEEKS. i had pnpm running for dozens of days! at one point,
four instances at once! awful pain. thankfully fixed now i could just
upgrade linux lol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer 1.severity: blocker This is preventing another PR or issue from being completed 6.topic: kernel The Linux kernel 6.topic: nodejs
Projects
None yet
Development

Successfully merging a pull request may close this issue.