CUS-349: CI: enable remote execution in jobs that can use it #45

jayconrod · 2024-08-30T18:26:35Z

Added platform and C++ toolchain configurations for platforms we use
that have configured remote execution pools. This is mostly copied
from github.com/EngFlow/examples.

In .bazelrc, added platform configurations like remote_linux_x64.
One of these needs to be specified when using remote execution.
Also, rename the opal configuration to engflow and split out engflow_bes.
We'll only support one cluster for this repo.

In workflows, added --config=<remote_platform> and replaced the
CLUSTER_CONFIG environment variable with either engflow or engflow_bes
as appropriate. The jobs driven with 'bazel run' cannot use remote
execution because that always runs on the host, so they're engflow_bes.

Added platform and C++ toolchain configurations for platforms we use that have configured remote execution pools. This is mostly copied from github.com/EngFlow/examples. In .bazelrc, added platform configurations like remote_linux_x64. One of these needs to be specified when using remote execution. Also split the opal and opal_bes_only configurations because not all jobs can use remote execution. In workflows, added --config=<remote_platform> and replaced the CLUSTER_CONFIG environment variable with either opal or opal_bes_only as appropriate. The jobs driven with 'bazel run' cannot use remote execution because that always runs on the host, so they're opal_bes_only.

jayconrod · 2024-08-30T19:20:26Z

Test runs:

main: https://github.com/EngFlow/auth/actions/runs/10638044492
release: https://github.com/EngFlow/auth/actions/runs/10638055339 (expected failure, can't be run on dev branch. I'll verify it works in a follow-up.)

ola-rozenfeld

I don't actually understand the bazel changes.

ola-rozenfeld · 2024-09-01T15:02:00Z

.bazelrc

-# TODO(CUS-349): enable remote execution
+build:engflow --config=engflow_common
+build:engflow --config=engflow_bes
+build:engflow --remote_executor=grpcs://opal.cluster.engflow.com


Note: glass will be faster, because the CI runner is on glass, so it will be faster if the RE action is on the same cluster. Fine to switch in a followup (or create separate configs for glass and opal so that the switch for both CI Runners and RE can be done more easily).

Can't use Glass yet, mainly because it's not authorized for open source, though I think we can change that. It also doesn't have a Mac RE pool, though we don't strictly need that.

ola-rozenfeld · 2024-09-01T15:11:10Z

release: https://github.com/EngFlow/auth/actions/runs/10638055339 (expected failure, can't be run on dev branch. I'll verify it works in a follow-up.)

About that: I was also annoyed that I had to find and comment out the code that does this check when I migrated the job to CI Runners (and it was not trivial). I think this should be easier (for example, an input parameter that needs to be manually checked in order to enable releasing from a dev branch). Everything should be possible from a dev branch, imo, otherwise debugging it is a pain; maybe the outputs of a release from a dev branch should be easily distinguishable from those of master, so that they couldn't be used?

ola-rozenfeld · 2024-09-01T15:13:48Z

Oh, just saw #46, you're taking care of that -- great!

minor-fixes · 2024-09-03T15:42:19Z

.github/workflows/main.yml

@@ -44,7 +42,7 @@ jobs:
      - "engflow-pool=ci_sysbox_x64"
      - "engflow-runtime=sysbox-runc"
      - "engflow-runner-id=${{ github.repository_id }}_bazel-builder_${{ github.run_id }}_${{ github.run_number }}_${{ github.run_attempt }}"
-    timeout-minutes: 10
+    timeout-minutes: 30


How is this timeout sized, now that it's extended to 30m?

10m wasn't fast enough to reliably complete a clean build on Opal. It's usually enough, but not always.

I hit a problem that was kind of a worst case: completely cold cache on a pool that had queueing but hadn't scaled up yet. The GoStdLib actions are particularly slow with a cold cache because they're not optimized for RE, and there are a lot of them because we're cross-compiling to every platform on every build.

jayconrod · 2024-09-04T16:01:05Z

I don't actually understand the bazel changes.

Which changes do you mean? Is there something I can explain?

I assume you mean the stuff in //platform. We need a platform and a C++ toolchain for each pool we want to use. The C++ toolchain is 99% of this diff. I copied it from the example repo, which copies whatever Bazel generates when run locally inside the worker's container.

ola-rozenfeld · 2024-09-04T16:29:36Z

I don't actually understand the bazel changes.

Which changes do you mean? Is there something I can explain?

I assume you mean the stuff in //platform. We need a platform and a C++ toolchain for each pool we want to use. The C++ toolchain is 99% of this diff. I copied it from the example repo, which copies whatever Bazel generates when run locally inside the worker's container.

Thank you, that helps! And all our customers do it like this, too, right? I'm wondering if there is any way it can be simplified for them -- something where they need to configure the minimal things for everything to work with their remote cluster, at least for the simpler use-cases...

jayconrod · 2024-09-04T16:50:33Z

There are a couple alternatives, but a user who needs a remote C++ toolchain needs something equivalent of this. Most of the rules depend on C++ in some way.

If the C++ toolchain is part of the remote container image, you need an explicit configuration like this. There's no way to generate that as part of the build because it's needed before the analysis phase. But I think it would be possible to write a tool that executes a remote action then generates the configuration for you.

The main alternative is a hermetic toolchain, where all files are provided as inputs to actions that use the toolchain, and the toolchain configuration is included in the package. That has some overhead, at least in constructing the input tree, but I haven't measured how big the difference is. It's certainly easier to use.

jayconrod added 5 commits August 30, 2024 11:25

revise

299c4fa

missed an opal

bd6aee4

fix copyright check, increase timeouts

5885096

mock username for test

63a1852

log in/out for release

2d67ef4

jayconrod marked this pull request as ready for review August 30, 2024 20:04

jayconrod requested review from minor-fixes and ola-rozenfeld August 30, 2024 20:04

ola-rozenfeld approved these changes Sep 1, 2024

View reviewed changes

minor-fixes approved these changes Sep 3, 2024

View reviewed changes

Merge branch 'main' into jay-ci-re3

ab93824

jayconrod merged commit 4ef5060 into main Sep 4, 2024
5 checks passed

jayconrod deleted the jay-ci-re3 branch September 4, 2024 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUS-349: CI: enable remote execution in jobs that can use it #45

CUS-349: CI: enable remote execution in jobs that can use it #45

jayconrod commented Aug 30, 2024 •

edited

Loading

jayconrod commented Aug 30, 2024 •

edited

Loading

ola-rozenfeld left a comment

ola-rozenfeld Sep 1, 2024

jayconrod Sep 4, 2024

ola-rozenfeld commented Sep 1, 2024

ola-rozenfeld commented Sep 1, 2024

minor-fixes Sep 3, 2024

jayconrod Sep 4, 2024

jayconrod commented Sep 4, 2024

ola-rozenfeld commented Sep 4, 2024

jayconrod commented Sep 4, 2024

CUS-349: CI: enable remote execution in jobs that can use it #45

CUS-349: CI: enable remote execution in jobs that can use it #45

Conversation

jayconrod commented Aug 30, 2024 • edited Loading

jayconrod commented Aug 30, 2024 • edited Loading

ola-rozenfeld left a comment

Choose a reason for hiding this comment

ola-rozenfeld Sep 1, 2024

Choose a reason for hiding this comment

jayconrod Sep 4, 2024

Choose a reason for hiding this comment

ola-rozenfeld commented Sep 1, 2024

ola-rozenfeld commented Sep 1, 2024

minor-fixes Sep 3, 2024

Choose a reason for hiding this comment

jayconrod Sep 4, 2024

Choose a reason for hiding this comment

jayconrod commented Sep 4, 2024

ola-rozenfeld commented Sep 4, 2024

jayconrod commented Sep 4, 2024

jayconrod commented Aug 30, 2024 •

edited

Loading

jayconrod commented Aug 30, 2024 •

edited

Loading