Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI management for Tier 2 platforms #1666

Closed
joyeecheung opened this issue Dec 18, 2024 · 35 comments
Closed

CI management for Tier 2 platforms #1666

joyeecheung opened this issue Dec 18, 2024 · 35 comments

Comments

@joyeecheung
Copy link
Member

joyeecheung commented Dec 18, 2024

Spinning from the #1663, during the TSC discussions today we noticed that the tiers are documented as follows in BUILDING.md

Tier 1: These platforms represent the majority of Node.js users. The Node.js Build Working Group maintains infrastructure for full test coverage. Test failures on tier 1 platforms will block releases.
Tier 2: These platforms represent smaller segments of the Node.js user base. The Node.js Build Working Group maintains infrastructure for full test coverage. Test failures on tier 2 platforms will block releases. Infrastructure issues may delay the release of binaries for these platforms.
Experimental: May not compile or test suite may not pass. The core team does not create releases for these platforms. Test failures on experimental platforms do not block releases. Contributions to improve support for these platforms are welcome.

Which does not say anything about Tier 2 platforms must be in the pull request testing CI, and there isn't much distinction between Tier 1 and Tier 2 if they are both in the testing CI.

Personally I think given the current situation of the testing CI, we are biting more than we can chew if we still insist in keeping the Tier 2 platforms in every pull request testing CI run. It requires more resources than we currently have, increase CI run time, and creates a bigger flake surface. (for example, in the reliability report yesterday you can see several tier 2 platforms on the list). It also shifts the burden of making every changes work on that platform to individual contributors, who can be new to the project and it can be frustrating for them to get blocked by these niche platforms that they don't care too much about, and overall making it harder to gain and retain new contributors.

Is it really more important to keep every pull request to pass all the tests in these tier 2 platforms than providing a frictionless experience for contributors? Personally I don't think that's the right prioritization. Keeping the tests green on a platform in the CI, specifically, is not the same as just "keeping Node.js work on that platform" - for example a test could pass perfectly fine on a well-maintained version of the platform, but still is in trouble for some reason in the CI on that platform, which may be strictly related to the state of the machine in the CI, or the specific setup of that machine. It does not seem right say for a new contributor to open a pull request to add util.something which is a straight-forward JS thing with a straight-forward test, and they end up scratching their head at some mysterious flake/machine problem on a niche tier 2 platform in the CI, and have to keep pinging other collaborators to restart the CI to get it landed (assuming they don't get frustrated enough themselves and give up on the PR). This can happen to tier 1 platform too, but at least the user base of tier 1 platforms is enough to deserve the attention of innocent new contributors, but tier 2 really isn't.

I think a more tenable solution would be to keep tier 2 platforms in the node-daily-master CI, and let Jenkins notify corresponding stakeholders when tests fail there. The stakeholders could come back opening an issue or opening a pull request to fix the test for that platform, and make sure that things are in order for the releases (if there are infrastructure issues that makes it hard to build the binaries in time for the release then they'll be delayed - which is precisely what the Tier 2 description says). For node-test-pull-request, we can either remove the tier 2 platforms, or only make sure that they can build but do not actually run the tests, as the marginal cost of keeping all the tests green on all these platforms isn't really worth the benefit at this point IMO.

cc @nodejs/tsc @nodejs/build

@targos
Copy link
Member

targos commented Dec 19, 2024

I think we should rework all the Tiers to have, in a nutshell:

  • Tier 1 (Maintained by the project / collaborators):
    • Tested on every pull request
    • We produce release binaries
    • Test failures prevent pull requests from landing
  • Tier 2 (Maintained by the community / interested parties):
    • Tested on every pull request
    • We do not produce release binaries
    • Test failures do not prevent pull requests from landing
  • Tier 3 (Experimental):
    • No tested on every pull request
  • We only accept pull requests that target tiered platforms
  • For Tier 2 platforms, we document who (GitHub team, individuals) responsible for their support and setup automation so they are notified when the build is broken or tests are failing.

@legendecas
Copy link
Member

Tier 2:

  • Tested on every pull request
  • Test failures do not prevent pull requests from landing

These two conditions conbimed do not sound like a healthy CI situation. If the test failures do not prevent pull requests from landing, it should not be tested on every pull request. It adds consufions.

@aduh95
Copy link
Contributor

aduh95 commented Dec 19, 2024

I agree, CI failure of a Tier 2 platform should block landing stuff until the team responsible for that platform has been pinged and come back with a path forward (e.g. wait for a fix to land, or skip the failing test), with a maximum waiting time after which that platform would automatically be downgraded to Tier 3, and the PR can land.

@gireeshpunathil
Copy link
Member

the stated (and desired, IMO) difference between t1 and t2 is the size of user base. not releasing binaries for t2 will affect all its users, whatever volume of users it might be in comparison with t1 users. I am worries about that approach, for the sake of providing frictionless experience to contributors.

lets compare this with security scenario. if there are less folks to fix vulnerabilities, will we let the code bleed out, or stop supporting part of the code? no, we raise funds to address it. I think for t2 platforms that have users but less contributors, we could adopt a similar mechanism, until we see the platform is not in use.

@targos
Copy link
Member

targos commented Dec 19, 2024

What I imagine is that CI failures would not be as visible as Tier 1 (for example not as a Red Cross in GitHub UI)

@marco-ippolito
Copy link
Member

I agree, CI failure of a Tier 2 platform should block landing stuff until the team responsible for that platform has been pinged and come back with a path forward (e.g. wait for a fix to land, or skip the failing test), with a maximum waiting time after which that platform would automatically be downgraded to Tier 3, and the PR can land.

I totally agree here, we should document also how/why/when a platform drops or climbs the tierlist

@targos
Copy link
Member

targos commented Dec 19, 2024

@gireeshpunathil I'm not suggesting we stop releasing binaries that we already release today. If we have Tier 2 platforms for which we release binaries, they should be promoted to Tier 1

@targos
Copy link
Member

targos commented Dec 19, 2024

@aduh95 what you suggest sgtm too

@joyeecheung
Copy link
Member Author

joyeecheung commented Dec 19, 2024

I agree, CI failure of a Tier 2 platform should block landing stuff until the team responsible for that platform has been pinged and come back with a path forward (e.g. wait for a fix to land, or skip the failing test), with a maximum waiting time after which that platform would automatically be downgraded to Tier 3, and the PR can land.

I think in the short term that sounds good but we also need to watch out for the case where "nobody has the time to ping anyone and things just get stalled for no good reason", since when a PR is opened, no one is obliged to review it or look into the CI failure, this means it still falls on the individual contributors to relentlessly ping other collaborators, who might know how this work, provided that

  1. The new contributor knows that this is what they need to do, not everyone gets that your PR don't get automatic sheperding and you need to take the initiative to bug other people
  2. The collaborator they find do know how this works - we have many collaborators that are just casual contributors and are not up-to-date with whatever policy there is - and have the time/keep it on their radar enough to help them ping the platform teams.

And as this is a volunteer-based project, I often see PRs getting abandoned or stalled just because "the CI doesn't pass on the first run" (sometimes by collaborators, too), which is something we should make some effort to avoid or it makes it difficult for the project to gain/retain new contributors.

We can see how this pans out and if it still ends up stalling pull request that would otherwise not have this friction if the tier 2 platforms aren't blocking any PR, we should consider removing them from the PR CI altogether.

@bnoordhuis
Copy link
Member

I think a more tenable solution would be to keep tier 2 platforms in the node-daily-master CI, and let Jenkins notify corresponding stakeholders when tests fail there. The stakeholders could come back opening an issue or opening a pull request to fix the test for that platform [..]

I wholeheartedly agree with this. Pull requests have waaaaay too much friction now, so much so that many people aren't going to bother, myself included. Take away the roadblocks and let the tier 2 platforms look after themselves.

@gireeshpunathil
Copy link
Member

while it is reasonable in an OSS project where contributor's time is voluntary, prioritising decisions based on contributor's perspective alone may risk alienating tier2 users - "come fix it yourself if you care". the project is sufficiently matured and being used heavily in the field (enterprise and non-enterprise) and TSC should try everything it can before amending release decisions and platform support matrix:

  • encouraging collaboration for those platforms
  • emphasising user impact and stakeholder needs
  • acknowledge t2 users may not have the expertise to maintain node.js
  • add elevation of t2 platform as project's long-term vision
  • outline ways to make contributions easier
    • mentoring programs
    • documentation
    • funding initiatives

this positions our relationship between contributors and users as a mutual one and reflects the spirit of community.

@joyeecheung
Copy link
Member Author

joyeecheung commented Dec 20, 2024

prioritising decisions based on contributor's perspective alone may risk alienating tier2 users

For tier 1 users, I think that's a risk we should consider, for tier 2 I don't think so as by definition they don't have as many users, and usually they are backed by some company who answers to these tier 2 users as their own customers. By caring too much about tier 2 users, we also risk giving the community the wrong impression that they are somehow a customer to the Node.js as well and are entitled to customer service from the project (this is already not true for tier 1 users, but at least tier 1 users are less likely to pay any company in a way that can create a delusion that the money is proxied to a random Node.js project contributor that they are talking to).

"come fix it yourself if you care".

I think this is already the reality, and is how a volunteer-based project should operate in general. For actual, generic bugs that affect all platforms, it has always been "come fix it yourself if you care". There are still many bugs lying around among the 1600+ issues with dozens of thumbs up over the years while no one makes enough effort to fix them. IMO it's too much of a luxury to prioritize anything about tier 2 platforms while things are already bad enough for generic bugs on all platforms and we have been in survival mode for a long time.

TSC should try everything it can before amending release decisions and platform support matrix:

The TSC is also based on volunteers. Unless there are volunteers in the TSC actively pledged to spend the time and make this list happen, the more likely outcome is that nothing is going to be done along these lines. Shifting the "biting off more than we can chew" problem from individual contributors to a subset of them (the TSC) isn't really going to solve the problem, either, the most practical solution remains to stop taking on more responsibility than what the volunteers are able to afford.

@anonrig
Copy link
Member

anonrig commented Dec 20, 2024

Take away the roadblocks and let the tier 2 platforms look after themselves.

I agree with @bnoordhuis here.

@richardlau
Copy link
Member

Personally I think given the current situation of the testing CI, we are biting more than we can chew if we still insist in keeping the Tier 2 platforms in every pull request testing CI run. It requires more resources than we currently have, increase CI run time, and creates a bigger flake surface. (for example, in the reliability report yesterday you can see several tier 2 platforms on the list). It also shifts the burden of making every changes work on that platform to individual contributors, who can be new to the project and it can be frustrating for them to get blocked by these niche platforms that they don't care too much about, and overall making it harder to gain and retain new contributors.

Looking at that list there is only one entry that only appeared on a tier 2 platform. Everything else also occurred on a tier 1.

@joyeecheung
Copy link
Member Author

joyeecheung commented Dec 20, 2024

Looking at that list there is only one entry that only appeared on a tier 2 platform. Everything else also occurred on a tier 1.

The flake does not need to solely reproducible on one platform - the problem is that running tests on too many platforms increases the flake surface. Suppose a flake have a 1% rate to fail on multiple platforms, the more platforms we run on every node-test-pull-request, the more likely it is to show up in at least one of them and create friction in a PR. One can argue that it is important to make sure that we test Node.js under stress to make sure the test always passes, even if we risk false positives from unstable machine states, that can be true for tier 1, but for tier 2 with less users, do we really need the test suite to be that perfect? I doubt so. If we can't even make it perfect for tier 1, trying to also make it perfect for tier 2 is biting off way more than what we can chew.

@richardlau
Copy link
Member

The majority of the current Tier 2 platforms are Red Hat/IBM ones and it is important for us (Red Hat/IBM) to have the CI run on every PR and to have release binaries on them. IBM stopped producing Node.js rebuilds in 2018 due to the investments (both machine resources and developers (including to upstream V8)) made in the community so for AIX there is nowhere else to get Node.js (RHEL customers can get the Red Hat Build of Node.js).

At a time where there is a lot of talk over sustainability and whether companies contribute towards open source projects I would point out that there is more than one way to contribute to projects and for Node.js, both IBM and MNX sponsor several machines which include Linux x64 (i.e. not their primary concern platforms) servers. This includes all three jenkins-workspace machines which are large (in terms of RAM/storage) machines which all pull request CI jobs go through for rebasing onto the base branch prior to fanning out onto the specific architectures.

Specifically on the IBM/RH side of things we have provisioned (through sponsorship) more than the guideline minimum number (2) of servers to the test CI for our platforms along with the effort required to keep them reliably running.

The reality is that of the releases where we have had to discuss releasing without a platform the most recent cases were Tier 1 (32-bit Windows in Node.js 18 and macOS in Node.js 23).

@ShogunPanda
Copy link
Contributor

My two cents here. Since it is already a fair long discussion I'll keep them brief.

  1. Remove Tier-2 and below from EACH PR and keep it in daily CI, as @joyeecheung proposed.
  2. As @aduh95: let's implement a procedure in which we give Tier-2 people some predefined time to unblock a release. Obviously, we need to also implement the procedure to undo the downgrade once all blocking issue have been solved.

@richardlau Does it pose a big difference for your platform to be in each PR compared to daily CI? If yes, can you please explain why?

@jclulow
Copy link

jclulow commented Dec 20, 2024

G'day!

Does it pose a big difference for your platform to be in each PR compared to daily CI? If yes, can you please explain why?

I'm not a Richard, but rather a member of the core team for illumos (known also, here, as SmartOS) and as another Tier 2 platform I think we're in a similar boat.

From our community's experience assisting in maintenance on other popular toolchains like Go and Rust, where we are also at Tier 2 status, it's super important to keep the context alive when dealing with platform issues. That is, it's important that we can at least aid in making a decision on whether to ignore a failure prior to integration of a change, so that we can participate in the discussion of the change while it is still fresh in the author's mind. Most of the time we'll presumably be able to offer immediate guidance for the vast majority of relatively small differences between UNIX platforms.

I don't believe anybody is demanding free work from contributors, to be clear: rather, the author of any change is best placed to help understand and amend that change -- and, understandably, they will have paged it out and moved on to other things once it lands! If anything, it seems more disrespectful of their time and flow to try and ask them questions later as an attempt by Tier 2 maintainers to catch up with regressions introduced into the software.

I am strongly in favour of continuing to test our platform in pre-integration CI, and as I mentioned in another issue just now, I am here to help improve communication between our two projects so that we can make that as painless as possible for f olks!

@mcollina
Copy link
Member

I think if we start landing PR with tests that are not passing on Tier 2 platforms, than we will quickly lose them as viable Node.js platforms. I don’t think this is good for Node.js, nor it is moving the responsibility alone on the companies volunteering to support those platforms.

Frankly, it’s very rare that I have a failure on any tier 2 platforms: most of them my frustration is directed at Windows (you might have a different mileage).

How much work is actually blocked by code that doesn’t work in Tier 2 platforms? As many have stated, the biggest issue in maintaining all our platforms are related to keeping the HW & OS up to date.

Can someone proposing this change do an inventory of PRs that are blocked by the inability of getting a change implemented in any Tier 2 plaftorms that are not related to HW & OS versions?

@joyeecheung
Copy link
Member Author

I think if we start landing PR with tests that are not passing on Tier 2 platforms, than we will quickly lose them as viable Node.js platforms.

Isn't this contradicting what you say:

Frankly, it’s very rare that I have a failure on any tier 2 platforms: most of them my frustration is directed at Windows

Either most PRs are not platform specific that it's fine to not test most PRs in the testing CI and only catch any real failure in the daily master CI, or most PRs would really fail on tier 2 platforms without special care that they will quickly become non viable?

My impression is that most failures on tier 2 platforms are flakes and catching them is meaningless to anyone, and only introduce friction to contribution. You don't need them to be real blockers - the fact that a PR works locally abut doesn't work in the CI and one has to look at Jenkins to find failures and ping someone else to restart CI is enough to make contributors frustrated and give up.

Can someone proposing this change do an inventory of PRs that are blocked by the inability of getting a change implemented in any Tier 2 plaftorms that are not related to HW & OS versions?

I don't think it needs to get to become final blockers to introduce friction - the fact that there are many flakes present on tier 2 platforms (you can find many of those in https://github.com/nodejs/reliability/tree/main/reports already show that they are increasing the friction. We have already not been able to keep the CI green on even just tier 1 platforms for many many years, trying to also extend it to tier 2 platforms is just biting off more than we can chew, and removing them from the testing CI would already make the CI greener simply because there will be fewer attempts to "reproduce" a flake. People don't get blocked by these flakes - they can simply stop trying when it doesn't work on the first try, since they aren't obliged to get the changes in, this happens not only to casual contributors but also collaborators. You'd only know what even is the blocker when the author tries hard enough to eliminate all the other friction, and that is already assuming too much from most contributors.

@anonrig
Copy link
Member

anonrig commented Dec 20, 2024

Can someone proposing this change do an inventory of PRs that are blocked by the inability of getting a change implemented in any Tier 2 plaftorms that are not related to HW & OS versions?

@mcollina Does not reflect the truth at all, but Node.js build has 3 access requests in the last 2 year. https://github.com/nodejs/build/issues?q=is%3Aissue%20label%3A%22access%20request%22%20label%3Aplatform%3Asmartos%20%20

You can also find all issues related to smartOS created in the Node.js build repository: https://github.com/nodejs/build/issues?q=is%3Aissue%20label%3Aplatform%3Asmartos

You can also find smartOS specific incidents in the Node.js CI:

For test-equinix_mnx-smartos: https://github.com/nodejs/jenkins-alerts/issues?q=is%3Aissue%20test-equinix_mnx-smartos20
For test-mnx-smartos23: https://github.com/nodejs/jenkins-alerts/issues?q=is%3Aissue%20%20test-mnx-smartos23

@joyeecheung
Copy link
Member Author

joyeecheung commented Dec 20, 2024

That is, it's important that we can at least aid in making a decision on whether to ignore a failure prior to integration of a change, so that we can participate in the discussion of the change while it is still fresh in the author's mind. Most of the time we'll presumably be able to offer immediate guidance for the vast majority of relatively small differences between UNIX platforms.

While it is great that there is help, it still requires someone to ask for help first, and Node.js isn't a project where this works particularly well. Many people would simply give up on a PR if it fails in the CI, as they are not obliged to get the chances in anyway, or if the CI doesn't pass on the first run, a new contributor needs to ping a collaborator to keep retrying until they know that it's a flake, and this back-and-forth alone is enough to stall many PRs. I think the only way this would work is for someone from the platform team to actively monitor every single pull request testing CI run, and proactively comment in the PR when it fails on that platform telling the author whether this is a flake that they can ignore or it's a true error and how it should be fixed, instead of waiting for someone to take the initiative to triage, find out who to ping, and do the ping (note that non-members cannot ping teams, so new contributors again have to ask for help from someone in the organization, which introduces even more friction). I don't see that kind of extra workload is a good use of the platform teams' time, either, but that's what it takes to reduce the friction. The alternative of only doing that for the daily CI seems much more reasonable to me.

If anything, it seems more disrespectful of their time and flow to try and ask them questions later as an attempt by Tier 2 maintainers to catch up with regressions introduced into the software.

Note that we are discussing about moving it to the daily master CI, so if someone from the platform team is paying attention to it, it should take no more than 24 hours for them to notice a merged PR that breaks a tier 2 platform in the daily master CI. That seems to be a reasonable timeframe to me - in Node.js you generally can't expect someone to reply to you for something not urgent in 24 hours, anyway.

What happens a lot more common these days, however, is not that people can miss a true platform-specific failure on tier 2 platforms if they don't run tests on it, but that people get too many false positives by trying to run the flaky test suite on too many platforms, including tier 2 ones, which increase their chances of reproducing a flake. And triaging a flake is already costly enough for many contributors that it can stall PRs as people get frustrated.

@ljharb
Copy link
Member

ljharb commented Dec 20, 2024

Can the tier 2 tests be run on every PR, but marked as "optional", so their failures are known but they don't block progress?

@jclulow
Copy link

jclulow commented Dec 20, 2024

a new contributor needs to ping a collaborator to keep retrying until they know that it's a flake

If there are enough flakey tests that this is a big issue, why not disable (or mark ignored) the flakey tests rather than entire platforms? If they're truly just flakey, perhaps retry their execution a few times prior to giving up?

It seems like that would go a lot further towards solving the problem you're describing, where casual contributors are discouraged by CI failures -- not just on Tier 2 platforms, but indeed on all platforms! It would also then be easier to get automated notifications of platform-specific CI failures sent to people that can help, because there would be a lot more confidence in the failure being a real failure rather than a flake.

@joyeecheung
Copy link
Member Author

joyeecheung commented Dec 20, 2024

why not disable (or mark ignored) the flakey tests rather than entire platforms? If they're truly just flakey, perhaps retry their execution a few times prior to giving up?

This is already what we do today, but

  1. Triaging whether a test is a true flake but not a bug is also work, it is not even rare for flakes to flake the CI for months until someone actually opens a CI to mark them as flaky. You can take a look at the trend of https://github.com/nodejs/reliability/tree/main/reports to see how little volunteering is done to triage the flakes and mark them as flaky (or if you want to see something really ridiculous - check out this flake that had been flaking the CI for more than a year without anyone doing anything to it, until I couldn't stand the CI situation, looked into the reliability reports, noticed it, and opened a PR to mark it as flaky test: mark test-http-regr-gh-2928 as flaky node#49565)
  2. Retrying requires CI access, and new contributors would have to ask someone who has the access over and over again
  3. Even for collaborators, it's still not rare that they can't get the CI green after many retries and get frustrated and give up.

Most people, myself included, would just do the bare minimum to keep what they are working on shephered through, and can't really invest all that much in the general situation of the CI. Constantly gardening the CI is a very demanding job, we used to have a bit more volunteering doing it, but these days we generally don't, and we are even drafting a SoW to hire someone to do all these work: #1629 - with no guarantee that when or whether it will actually happen.

If we are already this desperate in the amount of volunteering available to the CI even if we only look at tier 1 platforms, trying to run the tests on even more platforms just to reproduce them more really looks like biting off more than we can chew to me.

@mcollina
Copy link
Member

@joyeecheung I agree with you that the flaky tests are our main problem. Cutting down the supported platforms might improve that situation, but it's a bandaid with too many side effects and risks, including alienating a few of our corporate sponsors.
We have extensively discussed the CI situation in #1614 and are trying to improve the flaky test situation. Let's not mix the two things.

Note that we are discussing about moving it to the daily master CI, so if someone from the platform team is paying attention to it, it should take no more than 24 hours for them to notice a merged PR that breaks a tier 2 platform in the daily master CI. That seems to be a reasonable timeframe to me - in Node.js you generally can't expect someone to reply to you for something not urgent in 24 hours, anyway.

I don't think this will happen. Asking for people to be on call to verify a failed build every day that might be flaky is something I would not put on the companies supporting those platforms. Note that all the maintainers of those platforms are asking for their platforms to keep being tested for every CI. I'm not comfortable disregarding their point of view on this topic.

Putting it into other terms... how can we improve the situation for Tier 2 platforms while keeping them on the testing for every pull request?

@mcollina
Copy link
Member

mcollina commented Dec 20, 2024

@anonrig I specifically asked about work blocked by genuine OS differences between Tier 1 and Tier 2 platforms. You have ported examples of HW & OS problems. Are there any open PRs that have been blocked by a divergence of behaviors between the platforms that the relative teams did not address promptly?

@mcollina
Copy link
Member

@aduh95 point really reasonates with me and I think it provides a good solution:

I agree, CI failure of a Tier 2 platform should block landing stuff until the team responsible for that platform has been pinged and come back with a path forward (e.g. wait for a fix to land, or skip the failing test), with a maximum waiting time after which that platform would automatically be downgraded to Tier 3, and the PR can land.

I think we could put that to 2 weeks.

@joyeecheung
Copy link
Member Author

joyeecheung commented Dec 20, 2024

Asking for people to be on call to verify a failed build every day that might be flaky is something I would not put on the companies supporting those platforms.

It needs to be done by someone, at the end of the day. If not the platform teams, who would be the volunteers to verify these failed builds on tier 2 platforms?

how can we improve the situation for Tier 2 platforms while keeping them on the testing for every pull request?

We are already desperate enough even if we only care about tier 1 platforms, thinking about how we can improve the situation for tier 2 platforms when CI on tier 1 platforms has already been in chaos for years is too much of a luxury IMO (even if we only start from the point where I couldn't stand the CI and developed https://github.com/nodejs/reliability, that repository is already 6 years old, and unfortunately I don't think the CI is any better than what it was 6 years ago...)

@joyeecheung
Copy link
Member Author

Anyway, I only opened this issue because @mcollina asked me to during the TSC meeting. I don't think we have consensus on actually doing anything to node-test-pull-request, in that case I don't think anything of substance is going to change, and I think 48 hours is the maximum timespan that I should invest in the discussions on this topic, any more than this then I am dying on a hill that I am not really motivated to die on, and I feel that I have started to offend people that I don't really want to offend. Sorry for wasting people's time, but I am just going to close it to kick it out of my head during the holidays :)

@richardlau
Copy link
Member

richardlau commented Dec 20, 2024

@richardlau Does it pose a big difference for your platform to be in each PR compared to daily CI? If yes, can you please explain why?

Let's take for example experimental-strip-types. This was identified as failing on big endian platforms on 15 Jul and was resolved in swc. It didn't land on main until 24 Jul which is over a week later. So we're talking ten days from the first opportunity to spot the failure on a Tier 2 platform until the next daily build (25 Jul). Whoever on the IBM/RH side spots this (most likely me) would have to unravel how this feature is implemented and spot that the fix would require a new version of a dependency (i.e. the context (as aluded to in #1666 (comment)) is lost).

I am very aware that we have a (IMO bad) habit of landing features at the last minute for releases (often with semver majors, where I know we have changed things there to hopefully improve matters). Which will put added pressure on Tier 2 platforms to get things resolved before release. And I reiterate that for us it's important to have release binaries on our platforms -- I would push to get our platforms upgraded to Tier 1 if we decide not to release binaries for Tier 2 (as has been suggested).

(FYI today was my last working day of the year. I'll be back on 6 Jan. At this moment in time I'm disinclined to check GH over the holiday period.)

@anonrig
Copy link
Member

anonrig commented Dec 20, 2024

I think we should have a vote on this. If there is no consensus, the goal of TSC is to resolve it eventually by voting. @joyeecheung

@mcollina
Copy link
Member

@joyeecheung thanks for opening this. I think debating this was important.

@mhdawson
Copy link
Member

Just back from vacation and I see the discussion is closed so I won't say too much. I think @richardlau has coverved my thoughts pretty well in terms of the importance for Tier 2 platforms to be in the PR testing, binaries beeing shipped and that we've worked hard to ensure the IBM/RH platforms in Tier 2 don't block forward progress.

@joyeecheung
Copy link
Member Author

joyeecheung commented Jan 6, 2025

I think we should have a vote on this. If there is no consensus, the goal of TSC is to resolve it eventually by voting.

Feel free to open a vote or a new issue if you want to continue discussing. I think people have been overestimating the marginal benefit it brings to test platforms depended by <1% of the user base in every single node-test-pull-request CI v.s. just in node-daily-master CI (or additional node-test-commit CI in deps upgrades/PRs considered as risky by the platform teams, if need be), or underestimating the cost of relying on volunteers to triage additional flakes on these platforms under the current miserable situation of the test suites, allowing the platform teams to be too reactive instead of being proactive about the share of CI instability these platforms contribute to, but I don't have enough energy for this debate.

If we have to spend more than a week debating the merits of changing CI configuration, IMO that time would've been better spent on actually fixing flakes or infra, or even just triaging flakes which people generally don't do for others, though personally I don't want to volunteer that much time in the CI myself, as I fear it's very easy to be taken for granted if one invests too much free time on it, and it's something that really needs collective efforts from the mass/people getting paid instead of just taxing a few individuals to improve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests