Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process.report.getReport(); is potentially very slow #46060

Closed
adesso-os opened this issue Jan 2, 2023 · 16 comments · Fixed by #51645
Closed

process.report.getReport(); is potentially very slow #46060

adesso-os opened this issue Jan 2, 2023 · 16 comments · Fixed by #51645
Labels
report Issues and PRs related to process.report.

Comments

@adesso-os
Copy link

adesso-os commented Jan 2, 2023

Version

v18.12.1

Platform

Linux DE-ADN-2GPTTQ3 5.15.79.1-microsoft-standard-WSL2 #1 SMP Wed Nov 23 01:01:46 UTC 2022 x86_64 GNU/Linux

Subsystem

report

What steps will reproduce the bug?

Calling process.report.getReport(); is all it takes. The call will not return for longer than 1 minute, if the failure conditions are met.

How often does it reproduce? Is there a required condition?

The issue is 100% reproducible within scenarios I can only construct with our private code bases so far.

I am seeing this as part of the package installation process of yarn, which call process.report.getReport(); internally. This usually takes less than a second, but in certain scenarios, it takes more than a minute.

What is the expected behavior?

The call should return reliably within a timely fashion, not consuming more than 2-3 seconds.

What do you see instead?

Node hangs for more than a minute and is completely unresponsive.

Additional information

While looking into this issue, I found this comment in the source code of yarn:

    // Doing these calls early is important: it seems calling it after we've started the resolution incurs very high
    // performance penalty on WSL, with the Gatsby benchmark jumping from ~28s to ~130s. It's possible this is due to
    // the number of async tasks being listed in the report, although it's strange this doesn't occur on other systems.

These comments are above the code path that leats to the report being generated. This suggests that the behavior is a WSL-specific issue. Assuming that's the case, I'd still like to get better insights into the root cause.

@adesso-os
Copy link
Author

adesso-os commented Jan 2, 2023

From what I can tell so far with my limited understanding of Node internals, the report is delayed by the generation of the libuv section.

To generate this section, all open handles will be inspected and information about them is attached to the report. The handles that cause the majority of the delay are the ones relating to sockets. There seem to be at least 20+ open sockets at the time the report is generated.

static void ReportEndpoint(uv_handle_t* h,
reads like this might try to retrieve the hostname for the IP address on that socket, but I might be misreading that source.

I must assume the DNS queries take especially long in my environment for this to be this problematic, but I also feel like it might not be ideal to run DNS lookups in a fully synchronous code path. I got the impression that generating the report should be fairly quick. So any chance to introduce minute-long delays should likely be avoided.

@oliversalzburg
Copy link
Contributor

This code reproduces the issue reliably on one of my systems:

const https = require("node:https");

const SOCKET_COUNT = 10;
const sockets = [];
let allResolved = false;

const keepAliveAgent = new https.Agent({ keepAlive: true });

const performRequest = () => {
  const socketPromise = new Promise((resolve) => {
    https
      .get(
        {
          agent: keepAliveAgent,
          hostname: "registry.npmjs.org",
        },
        () => {
          process.stdout.write(".");
          resolve();
        }
      )
      .on("error", (error) => {
        console.error(error);
        resolve();
      });
  });
  sockets.push(socketPromise);
};

process.stdout.write(`${new Date().toISOString()} opening ${SOCKET_COUNT} sockets`);
for (let loops = 0; loops < SOCKET_COUNT; ++loops) {
  performRequest();
}

Promise.all(sockets).then(() => {
  allResolved = true;
  console.log(
    `\n${new Date().toISOString()} all sockets created. requesting report...`
  );
  process.report.getReport();

  console.log(`${new Date().toISOString()} destroying client`);
  keepAliveAgent.destroy();

  console.log(`${new Date().toISOString()} end`);
});

const check = () => {
  if (!allResolved) {
    setTimeout(check, 200);
    return;
  }
};

check();

On this system, it takes about 20s to generate the report.

@cjihrig
Copy link
Contributor

cjihrig commented Jan 2, 2023

FWIW, on my machine (M1 MBP), the report is generated in much less than one second.

@oliversalzburg
Copy link
Contributor

FWIW, on my machine (M1 MBP), the report is generated in much less than one second.

Interesting. It was suggested in the yarn source that this is somehow WSL-specific. So far, I'm having a hard time understanding what aspect of WSL would introduce this, but I also haven't compared with non-WSL Linuxes yet.

@oliversalzburg
Copy link
Contributor

oliversalzburg commented Jan 2, 2023

After playing around a bit at https://github.com/oliversalzburg/report-repro/, I can see that RDNS is being performed on the addresses of all open sockets. With a large enough number of sockets, I could also push the report generation over the multi-second mark in GitHub Actions on ubuntu-latest.

I still feel like this is problematic, but the behavior is certainly amplified in WSL. The default behavior of which is to register an internal DNS resolver into /etc/resolv.conf, which seems to perform poorly. I assume the idea here is to allow resolution of names on the local network.

When I inspect the answers from this resolver, the problem also becomes clearer:

$ dig -x 1.1.1.1 @172.22.176.1 +noall +answer
1.1.1.1.in-addr.arpa.   0       IN      PTR     one.one.one.one.

$ dig -x 1.1.1.1 @8.8.8.8 +noall +answer
1.1.1.1.in-addr.arpa.   1006    IN      PTR     one.one.one.one.

The WSL resolver will set the TTL of all answers to 0, which prevents any DNS caching and the resolution will have to be done over and over for all open sockets.

A workaround is to configure a different DNS server for the WSL environment. https://learn.microsoft.com/en-us/windows/wsl/wsl-config#network-settings

@adesso-os
Copy link
Author

I also reported this for WSL: microsoft/WSL#9423

@bnoordhuis
Copy link
Member

@cjihrig I can reproduce (on Linux) when I blackhole outgoing DNS traffic.

It's the uv_getnameinfo() calls that are problematic - every socket adds a 30s delay - but I can see why you'd want the report to have that info. Tough call.

@adesso-os adesso-os changed the title process.report.getReport(); is very slow process.report.getReport(); is potentially very slow Jan 3, 2023
@joyeecheung
Copy link
Member

joyeecheung commented Jan 4, 2023

Perhaps we can add an option to process.report.getReport() to not do DNS queries (or any other operations that rely on stable network configurations)? Or an environment variable that is effective for all the other report triggers?

@arcanis
Copy link
Contributor

arcanis commented Nov 13, 2023

Some users have reported 6.5 minutes of install time when running yarn install due to getReport:

image

@lpic10
Copy link

lpic10 commented Nov 13, 2023

I was facing similar issue when using yarn v2 on a Ubuntu VM with hyper-v, instead of wsl.

Thanks to the code by @oliversalzburg to reproduce the issue I found that the systemd-resolved was the culprit and this thread helped me to find a workaround: systemd/systemd#28166

Maybe this can help someone with similar setup as mine:

  • disable multicast DNS and LLMNR by editing /etc/systemd/resolved.conf as below
MulticastDNS=no
LLMNR=no
  • restart systemd-resolved

systemctl restart systemd-resolved

@Ethan-Arrowood
Copy link
Contributor

Hey folks, I'm going to try fixing this! I'm going to implement @joyeecheung 's idea: #46060 (comment)

@joyeecheung
Copy link
Member

joyeecheung commented Jan 24, 2024

By the way I don't know what exactly yarn is using getReport() for so having an option to omit data that requires network connection may not automatically fix the issue for yarn if yarn doesn't pick it up or if it does need that kind of data.

@joyeecheung
Copy link
Member

It looks like yarn is using it to determine libc variant https://github.com/yarnpkg/berry/blob/6db7b2195bea567a275f3b95221c0517bd925dea/packages/yarnpkg-core/sources/nodeUtils.ts#L29 IMO that's using a sledgehammer to crack a nut, we should probably just provide an API specifically for listing the loaded shared objects.

@oliversalzburg
Copy link
Contributor

@joyeecheung Please note this comment: yarnpkg/berry#5167 (comment), specifically the reference to #41338.

Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 2, 2024
New option `--report-disable-network`, also available as `report.disableNetwork`, enables the user to disable networking interfaces in their diagnostic report. On some systems, this can cause the report to take minutes to generate so this option can be used to optimize that.

fixes: nodejs#46060
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 2, 2024
New option `--report-disable-network`, also available as `report.disableNetwork`, enables the user to disable networking interfaces in their diagnostic report. On some systems, this can cause the report to take minutes to generate so this option can be used to optimize that.

fixes: nodejs#46060
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 2, 2024
New option `--report-disable-network`, also available as `report.disableNetwork`, enables the user to disable networking interfaces in their diagnostic report. On some systems, this can cause the report to take minutes to generate so this option can be used to optimize that.

fixes: nodejs#46060
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 5, 2024
Adds a new option `process.report.networkDisabled` and
cli option `--report-network-disabled` which will disable
any netowkring operations for the `report` generation.

Fixes: nodejs#46060
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 5, 2024
Adds a new option `process.report.networkDisabled` and
cli option `--report-network-disabled` which will disable
any netowkring operations for the `report` generation.

Fixes: nodejs#46060

PR-URL: nodejs#51645
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 5, 2024
Adds a new option `process.report.networkDisabled` and
cli option `--report-network-disabled` which will disable
any netowkring operations for the `report` generation.

Fixes: nodejs#46060

PR-URL: nodejs#51645
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 9, 2024
New option `--report-exclude-network`, also available as `report.excludeNetwork`,
enables the user to exclude networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes to generate so this
option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 9, 2024
New option `--report-exclude-network`, also available as `report.excludeNetwork`,
enables the user to exclude networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes to generate so this
option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 9, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 9, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 9, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 12, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 12, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 14, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645

Co-authored-by: Joyee Cheung <[email protected]>
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 14, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645

Co-authored-by: Joyee Cheung <[email protected]>
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 15, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645

Co-authored-by: Joyee Cheung <[email protected]>
Ethan-Arrowood added a commit to Ethan-Arrowood/node that referenced this issue Feb 21, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645

Co-authored-by: Joyee Cheung <[email protected]>
nodejs-github-bot pushed a commit that referenced this issue Mar 1, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: #46060

PR-URL: #51645

Co-authored-by: Joyee Cheung <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
rdw-msft pushed a commit to rdw-msft/node that referenced this issue Mar 26, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: nodejs#46060

PR-URL: nodejs#51645

Co-authored-by: Joyee Cheung <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
marco-ippolito pushed a commit that referenced this issue May 2, 2024
New option `--report-exclude-network`, also available as
`report.excludeNetwork`, enables the user to exclude
networking interfaces in their diagnostic report.
On some systems, this can cause the report to take minutes
to generate so this option can be used to optimize that.

Fixes: #46060

PR-URL: #51645

Co-authored-by: Joyee Cheung <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
@Tofandel
Copy link
Contributor

Tofandel commented Oct 28, 2024

It looks like yarn is using it to determine libc variant

Now npm is also using it for this exact same purpose since 10.4.0 and this issue still seem to happen even with exclude-network
https://github.com/npm/npm-install-checks/blob/main/lib/current-env.js#L22-L25

npm/cli#4028
npm/cli#7814

@Tofandel
Copy link
Contributor

Tofandel commented Oct 28, 2024

I just tried with the test case in #46060 (comment)

Running node --report-exclude-network test.js

And I can confirm this issue is not solved, the excludeNetwork option did not target the root of the issue which was libuv doing reverse DNS queries

With even one socket open
2024-10-28T22:51:00.664Z all sockets created. requesting report...
report: 20.850s
2024-10-28T22:51:21.514Z destroying client
2024-10-28T22:51:21.514Z end

And the script also never exits because of https.get

There is the same problem with fetch (minus the program not exiting)

console.log(process._getActiveHandles()); shows that the socket is not destroyed by the end of the program

If I close the socket manually then everything works as expected

(res) => {
  process.stdout.write(".");
  resolve();
  res.socket.destroy();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
report Issues and PRs related to process.report.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants