-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUS-349: CI: enable remote execution in jobs that can use it #45
Conversation
Added platform and C++ toolchain configurations for platforms we use that have configured remote execution pools. This is mostly copied from github.com/EngFlow/examples. In .bazelrc, added platform configurations like remote_linux_x64. One of these needs to be specified when using remote execution. Also split the opal and opal_bes_only configurations because not all jobs can use remote execution. In workflows, added --config=<remote_platform> and replaced the CLUSTER_CONFIG environment variable with either opal or opal_bes_only as appropriate. The jobs driven with 'bazel run' cannot use remote execution because that always runs on the host, so they're opal_bes_only.
Test runs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually understand the bazel changes.
# TODO(CUS-349): enable remote execution | ||
build:engflow --config=engflow_common | ||
build:engflow --config=engflow_bes | ||
build:engflow --remote_executor=grpcs://opal.cluster.engflow.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: glass will be faster, because the CI runner is on glass, so it will be faster if the RE action is on the same cluster. Fine to switch in a followup (or create separate configs for glass and opal so that the switch for both CI Runners and RE can be done more easily).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use Glass yet, mainly because it's not authorized for open source, though I think we can change that. It also doesn't have a Mac RE pool, though we don't strictly need that.
About that: I was also annoyed that I had to find and comment out the code that does this check when I migrated the job to CI Runners (and it was not trivial). I think this should be easier (for example, an input parameter that needs to be manually checked in order to enable releasing from a dev branch). Everything should be possible from a dev branch, imo, otherwise debugging it is a pain; maybe the outputs of a release from a dev branch should be easily distinguishable from those of master, so that they couldn't be used? |
Oh, just saw #46, you're taking care of that -- great! |
@@ -44,7 +42,7 @@ jobs: | |||
- "engflow-pool=ci_sysbox_x64" | |||
- "engflow-runtime=sysbox-runc" | |||
- "engflow-runner-id=${{ github.repository_id }}_bazel-builder_${{ github.run_id }}_${{ github.run_number }}_${{ github.run_attempt }}" | |||
timeout-minutes: 10 | |||
timeout-minutes: 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this timeout sized, now that it's extended to 30m?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10m wasn't fast enough to reliably complete a clean build on Opal. It's usually enough, but not always.
I hit a problem that was kind of a worst case: completely cold cache on a pool that had queueing but hadn't scaled up yet. The GoStdLib
actions are particularly slow with a cold cache because they're not optimized for RE, and there are a lot of them because we're cross-compiling to every platform on every build.
Which changes do you mean? Is there something I can explain? I assume you mean the stuff in |
Thank you, that helps! And all our customers do it like this, too, right? I'm wondering if there is any way it can be simplified for them -- something where they need to configure the minimal things for everything to work with their remote cluster, at least for the simpler use-cases... |
There are a couple alternatives, but a user who needs a remote C++ toolchain needs something equivalent of this. Most of the rules depend on C++ in some way. If the C++ toolchain is part of the remote container image, you need an explicit configuration like this. There's no way to generate that as part of the build because it's needed before the analysis phase. But I think it would be possible to write a tool that executes a remote action then generates the configuration for you. The main alternative is a hermetic toolchain, where all files are provided as inputs to actions that use the toolchain, and the toolchain configuration is included in the package. That has some overhead, at least in constructing the input tree, but I haven't measured how big the difference is. It's certainly easier to use. |
Added platform and C++ toolchain configurations for platforms we use
that have configured remote execution pools. This is mostly copied
from github.com/EngFlow/examples.
In .bazelrc, added platform configurations like remote_linux_x64.
One of these needs to be specified when using remote execution.
Also, rename the opal configuration to engflow and split out engflow_bes.
We'll only support one cluster for this repo.
In workflows, added --config=<remote_platform> and replaced the
CLUSTER_CONFIG environment variable with either engflow or engflow_bes
as appropriate. The jobs driven with 'bazel run' cannot use remote
execution because that always runs on the host, so they're engflow_bes.