Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUS-349: CI: enable remote execution in jobs that can use it #45

Merged
merged 7 commits into from
Sep 4, 2024

Conversation

jayconrod
Copy link
Contributor

@jayconrod jayconrod commented Aug 30, 2024

Added platform and C++ toolchain configurations for platforms we use
that have configured remote execution pools. This is mostly copied
from github.com/EngFlow/examples.

In .bazelrc, added platform configurations like remote_linux_x64.
One of these needs to be specified when using remote execution.
Also, rename the opal configuration to engflow and split out engflow_bes.
We'll only support one cluster for this repo.

In workflows, added --config=<remote_platform> and replaced the
CLUSTER_CONFIG environment variable with either engflow or engflow_bes
as appropriate. The jobs driven with 'bazel run' cannot use remote
execution because that always runs on the host, so they're engflow_bes.

Added platform and C++ toolchain configurations for platforms we use
that have configured remote execution pools. This is mostly copied
from github.com/EngFlow/examples.

In .bazelrc, added platform configurations like remote_linux_x64.
One of these needs to be specified when using remote execution.
Also split the opal and opal_bes_only configurations because not all
jobs can use remote execution.

In workflows, added --config=<remote_platform> and replaced the
CLUSTER_CONFIG environment variable with either opal or opal_bes_only
as appropriate. The jobs driven with 'bazel run' cannot use remote
execution because that always runs on the host, so they're opal_bes_only.
@jayconrod
Copy link
Contributor Author

jayconrod commented Aug 30, 2024

Test runs:

@jayconrod jayconrod marked this pull request as ready for review August 30, 2024 20:04
Copy link
Contributor

@ola-rozenfeld ola-rozenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually understand the bazel changes.

# TODO(CUS-349): enable remote execution
build:engflow --config=engflow_common
build:engflow --config=engflow_bes
build:engflow --remote_executor=grpcs://opal.cluster.engflow.com
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: glass will be faster, because the CI runner is on glass, so it will be faster if the RE action is on the same cluster. Fine to switch in a followup (or create separate configs for glass and opal so that the switch for both CI Runners and RE can be done more easily).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use Glass yet, mainly because it's not authorized for open source, though I think we can change that. It also doesn't have a Mac RE pool, though we don't strictly need that.

@ola-rozenfeld
Copy link
Contributor

About that: I was also annoyed that I had to find and comment out the code that does this check when I migrated the job to CI Runners (and it was not trivial). I think this should be easier (for example, an input parameter that needs to be manually checked in order to enable releasing from a dev branch). Everything should be possible from a dev branch, imo, otherwise debugging it is a pain; maybe the outputs of a release from a dev branch should be easily distinguishable from those of master, so that they couldn't be used?

@ola-rozenfeld
Copy link
Contributor

Oh, just saw #46, you're taking care of that -- great!

@@ -44,7 +42,7 @@ jobs:
- "engflow-pool=ci_sysbox_x64"
- "engflow-runtime=sysbox-runc"
- "engflow-runner-id=${{ github.repository_id }}_bazel-builder_${{ github.run_id }}_${{ github.run_number }}_${{ github.run_attempt }}"
timeout-minutes: 10
timeout-minutes: 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this timeout sized, now that it's extended to 30m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10m wasn't fast enough to reliably complete a clean build on Opal. It's usually enough, but not always.

I hit a problem that was kind of a worst case: completely cold cache on a pool that had queueing but hadn't scaled up yet. The GoStdLib actions are particularly slow with a cold cache because they're not optimized for RE, and there are a lot of them because we're cross-compiling to every platform on every build.

@jayconrod
Copy link
Contributor Author

I don't actually understand the bazel changes.

Which changes do you mean? Is there something I can explain?

I assume you mean the stuff in //platform. We need a platform and a C++ toolchain for each pool we want to use. The C++ toolchain is 99% of this diff. I copied it from the example repo, which copies whatever Bazel generates when run locally inside the worker's container.

@ola-rozenfeld
Copy link
Contributor

I don't actually understand the bazel changes.

Which changes do you mean? Is there something I can explain?

I assume you mean the stuff in //platform. We need a platform and a C++ toolchain for each pool we want to use. The C++ toolchain is 99% of this diff. I copied it from the example repo, which copies whatever Bazel generates when run locally inside the worker's container.

Thank you, that helps! And all our customers do it like this, too, right? I'm wondering if there is any way it can be simplified for them -- something where they need to configure the minimal things for everything to work with their remote cluster, at least for the simpler use-cases...

@jayconrod
Copy link
Contributor Author

There are a couple alternatives, but a user who needs a remote C++ toolchain needs something equivalent of this. Most of the rules depend on C++ in some way.

If the C++ toolchain is part of the remote container image, you need an explicit configuration like this. There's no way to generate that as part of the build because it's needed before the analysis phase. But I think it would be possible to write a tool that executes a remote action then generates the configuration for you.

The main alternative is a hermetic toolchain, where all files are provided as inputs to actions that use the toolchain, and the toolchain configuration is included in the package. That has some overhead, at least in constructing the input tree, but I haven't measured how big the difference is. It's certainly easier to use.

@jayconrod jayconrod merged commit 4ef5060 into main Sep 4, 2024
5 checks passed
@jayconrod jayconrod deleted the jay-ci-re3 branch September 4, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants