Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toolset update: VS 2022 17.10 Preview 4, non-spot VMs, Azure Pipelines overhaul #4594

Merged
merged 45 commits into from
Apr 19, 2024

Conversation

StephanTLavavej
Copy link
Member

@StephanTLavavej StephanTLavavej commented Apr 16, 2024

📜 Changelog

  • Infrastructure improvements:
    • Overhauled our Azure Pipelines machinery, introducing "Early Build" stages to quickly find compiler errors when building the STL.
    • Updated dependencies.
      • Updated build compiler to VS 2022 17.10 Preview 4.

📖 Non-Changelog Summary

This is a response to the increasing eviction rate of spot VMs that we've recently been suffering. It changes our pool to regular (non-spot) VMs, which will result in far more reliable PR/CI builds. Regular VMs are ~2.5x more expensive for us than spot VMs, although this will be inherently mitigated by fewer reruns being necessary.

To further mitigate this cost increase, I've dramatically overhauled our Azure Pipelines machinery. Now, in parallel with the Code Format stage, we'll run 4 Early Build stages, verifying that the STL itself builds for each architecture. This also extracts building the STL with /analyze (which is significantly slower than an ordinary build), and building the benchmarks. If any of these builds fail, that indicates a severe problem that needs to be fixed before we run any of the tests. If the Code Format and Early Build stages pass, then we "fan in" and run the x64 build/tests, followed by "fan out" to non-x64 build/tests as usual. The full build/test stages (across 32 shards) are somewhat faster due to the /analyze and benchmark builds being lifted out.

The /analyze build imposes approximately 1 minute of additional cost, so overall we're saving ~32 minutes of compute across the full build/test stages. The Early Builds take about 3 minutes per architecture, so we're paying ~12 minutes there, so there should be a ~20 minute net savings for a fully successful run. In the event that an Early Build finds a problem, we save tons of compute (either 8 full shards for an x64 failure, or the whole 32 shards for a non-x64 failure). Note that failed Code Format validation now spends more compute, but I felt that this was an acceptable tradeoff. The overall critical path is mostly unchanged (we save ~2 minutes by avoiding two /analyze costs, but the ~3 minutes for an Early Build takes longer than the ~1.5 minutes for Code Format).

I had to massively overhaul the machinery to make this restructuring feasible. This overhaul became a top-level improvement by itself (at least for maintainers) - by extracting repetition and eliminating unnecessary logic, I saved ~40 net lines even while adding all of the Early Build logic. The result should be far more maintainable and extensible.

⚙️ Azure Pipelines Commits

  • Behavioral simplification: Drop testParallelism; lit uses all CPUs by default.
  • Change amd64 to x64; VsDevCmd.bat handles them as synonyms.
  • In cross-build.yml, rename vsDevCmdArch to targetArch.
    • This matches how it's being passed to cmake-configure-build.yml and run-tests.yml.
  • In native-build-test.yml, split vsDevCmdArch into hostArch and targetArch.
    • This matches what's being passed to cmake-configure-build.yml and run-tests.yml.
  • Explicitly pass hostArch: x64 to cross-build.yml.
  • Unify away targetPlatform; it was identical to targetArch.
  • Consistently order hostArch before targetArch.
  • Cosmetic change: Unify 'Build Tests' and 'Run Tests' into 'Build and Run Tests'.
    • Spending logic to vary the displayName is unnecessary.
  • Use long-form ctest options for clarity.
  • Rename testSelection to ctestOptions.
    • This clarifies who the ultimate consumer is.
  • Drop unnecessary comments.
  • In cross-build.yml, change hardcoded ctestOptions to a parameter with a default argument.
    • This is more complicated, but it's a step towards unification.
  • In asan-pipeline.yml, don't bother centralizing '--tests-regex stlasan' into a variable.
    • IMO this extra step was making things harder to follow.
  • In cross-build.yml, take asanBuild as a parameter with a default argument.
    • Again, this is a step towards unification. This makes a clarity improvement possible - now we can remove the asanBuild default argument from cmake-configure-build.yml. IMO having defaults at different "levels" was very confusing.
  • Change how cmakeAdditionalFlags is defaulted.
    • Clarity improvement: Don't default it at the cmake-configure-build.yml level.
    • Both native-build-test.yml and cross-build.yml now take it as a parameter, defaulting to empty, and pass it down.
    • Finally, azure-pipelines.yml passes cmakeAdditionalFlags: '-DTESTS_BUILD_ONLY=ON' when performing cross builds.
    • This is the final step towards unification.
  • Unify (identical) native-build-test.yml and cross-build.yml into build-and-test.yml.
    • Saying "and" in the name clarifies that we're doing two separate things.
  • Consistently sort parameters after hostArch, targetArch.
  • Change the buildBenchmarks default from true to false.
  • Consistently capitalize the PowerShell@2 task.
  • Drop unnecessary pwsh: false.
  • Simplify task: CmdLine@2 to the script: shortcut.
  • Add parameter analyzeBuild to control STL_USE_ANALYZE.
    • Originally hardcoded to ON, now defaulted to true. (These are synonyms to CMake.)
  • Change the analyzeBuild default from true to false.
  • Behavior change: Don't enable analyzeBuild in asan-pipeline.yml.
  • Simplify checkout-sources.yml by dropping unnecessary parameters.
    • These parameters (llvmSHAVar etc.) were storing the names of Azure Pipelines variables to create (llvmSHA), confusingly sharing the same names as PowerShell variables ($llvmSHA etc.). This layer of indirection was unnecessary. We can take the names llvmSHAVar etc. and directly use them as the names of Azure Pipelines variables.
  • In checkout-sources.yml, drop unnecessary semicolons when setting Azure Pipelines variables.
  • Extract git submodule status into a loop.
  • Perform regex replacement on a separate line.
    • This simplifies the syntax and avoids packing too much logic into a single line. It also allows us to put the regex directly next to its replacement, clarifying how the capture group is connected.
  • Scripts don't need to cd $(Build.SourcesDirectory) as that's their default workingDirectory.
  • Add doTesting to control whether we checkout LLVM and build/run tests.
    • (I changed my mind and altered this to skipTesting below, but rewriting history was too much effort. Sorry for the extra complexity.)
  • Major behavioral change: Add 'Early Build' stages.
    • Alongside 'Code Format', we begin by fanning out an 'Early Build' for each architecture. This verifies that the STL builds, and also covers /analyze and the benchmarks (the latter is still skipped for plain ARM). It skips all testing, and saves more time by skipping the LLVM checkout. Then we fan in for the normal x64 build and test. This results in several improvements:
      • Any build breaks (whether ordinary, /analyze, or benchmark) are caught early and cheaply. Previously, we'd spend at least 8 shards, or the full 32 for a non-x64 break.
      • This reduces the length of the critical path (or so I thought, but not as much as I'd hoped). We previously paid two /analyze and benchmark builds on the critical path (x64, then non-x64). Now we perform one /analyze and benchmark build simultaneously with 'Code Format' which takes almost as long, so it's nearly free.
      • By removing work from the 32 full test shards (/analyze was expensive although the benchmarks are currently cheap), we're reducing the surface area for eviction, and reducing the amount of work that needs to be rerun after evictions.
    • If 'Code Format' fails, this performs a bit more work than before, but less than 1 full test shard.
  • Cosmetic change: Display 'Build and Test' for ARM/ARM64.
    • We're still compiling the tests, just not running them.
  • Behavior change: Make buildBenchmarks control whether we checkout google-benchmark.
  • git sparse-checkout now defaults to --cone.
  • Make 'Setup TMP Directory' more consistent.
    • Move one occurrence from build-and-test.yml into checkout-sources.yml, so it consistently appears before checkout: self.
    • Change the other occurrence in format-validation.yml to also use a single-line if exist command.
  • Behavior change: Set fetchDepth: 1 and fetchTags: false.
  • Replace general cmakeAdditionalFlags with specific testsBuildOnly.
    • And we don't need to use this when configuring the benchmarks.
  • Changed my mind, use skipTesting with a default of false.
  • Split checkout-sources.yml into checkout-self.yml and checkout-submodules.yml.
    • Now format-validation.yml can use checkout-self.yml, avoiding duplication.
  • Behavioral simplification: Avoid 3x duplication with checkout-submodule.yml.
    • This extracts 3 copy-pasted cmd scripts into 1 PowerShell script.
    • We no longer need to use Azure Pipelines variables to communicate the SHA.
    • Each remote is now named submodule-upstream for uniformity, as the name doesn't matter.
    • I've performed a few additional simplifications that I believe are proper, but we'll need to watch out for problems when agents reuse repos:
      • After top-level self-checkout, each submodule directory should exist, so we shouldn't have to force-create it.
      • If the .git directory doesn't exist in the submodule, we shouldn't need to obliterate all other files there. We're going to perform a checkout and clean that should restore us to a known good state.
      • If the .git directory already exists, running git init again is harmless by design.
      • Instead of the "run git remote get-url and look for failure" technique, we can check the output of git remote to make git remote add idempotent.

🧰 Toolset Update Commits

@StephanTLavavej StephanTLavavej added infrastructure Related to repository automation uncharted Excluded from the Status Chart labels Apr 16, 2024
@StephanTLavavej StephanTLavavej requested a review from a team as a code owner April 16, 2024 00:13
This matches how it's being passed to cmake-configure-build.yml and run-tests.yml.
…argetArch`.

This matches what's being passed to cmake-configure-build.yml and run-tests.yml.
…Run Tests'.

Spending logic to vary the `displayName` is unnecessary,
especially because we already distinguish
'Build and Test x64' vs. 'Build ARM64' at the top level.
This clarifies who the ultimate consumer is.
…th a default argument.

This is more complicated, but it's a step towards unification.
…an'` into a variable.

IMO this extra step was making things harder to follow.
…gument.

Again, this is a step towards unification.

This makes a clarity improvement possible - now we can remove
the `asanBuild` default argument from cmake-configure-build.yml.
IMO having defaults at different "levels" was very confusing.
Clarity improvement: Don't default it at the cmake-configure-build.yml level.

Both native-build-test.yml and cross-build.yml now take it as a parameter, defaulting to empty, and pass it down.

Finally, azure-pipelines.yml passes `cmakeAdditionalFlags: '-DTESTS_BUILD_ONLY=ON'` when performing cross builds.

This is the final step towards unification.
…d-and-test.yml.

Saying "and" in the name clarifies that we're doing two separate things.
Originally hardcoded to `ON`, now defaulted to `true`. (These are synonyms to CMake.)
These parameters (`llvmSHAVar` etc.) were storing the names of Azure Pipelines
variables to create (`llvmSHA`), confusingly sharing the same names as
PowerShell variables (`$llvmSHA` etc.).

This layer of indirection was unnecessary. We can take the names `llvmSHAVar` etc.
and directly use them as the names of Azure Pipelines variables.
Alongside 'Code Format', we begin by fanning out an 'Early Build' for each architecture.
This verifies that the STL builds, and also covers `/analyze` and the benchmarks (the latter is still skipped for plain ARM). It skips all testing, and saves more time by skipping the LLVM checkout.
Then we fan in for the normal x64 build and test.

This results in several improvements:

* Any build breaks (whether ordinary, `/analyze`, or benchmark) are caught early and cheaply. Previously, we'd spend at least 8 shards, or the full 32 for a non-x64 break.
* This reduces the length of the critical path. We previously paid two `/analyze` and benchmark builds on the critical path (x64, then non-x64). Now we perform one `/analyze` and benchmark build simultaneously with 'Code Format' which takes almost as long, so it's nearly free.
* By removing work from the 32 full test shards (`/analyze` was expensive although the benchmarks are currently cheap), we're reducing the surface area for eviction, and reducing the amount of work that needs to be rerun after evictions.

If 'Code Format' fails, this performs a bit more work than before, but less than 1 full test shard.
We're still compiling the tests, just not running them.
Move one occurrence from build-and-test.yml into checkout-sources.yml, so it consistently appears before `checkout: self`.

Change the other occurrence in format-validation.yml to also use a single-line `if exist` command.
These settings observably improve our checkout behavior.
And we don't need to use this when configuring the benchmarks.
…ules.yml.

Now format-validation.yml can use checkout-self.yml, avoiding duplication.
…le.yml.

This extracts 3 copy-pasted cmd scripts into 1 PowerShell script.

We no longer need to use Azure Pipelines variables to communicate the SHA.

Each remote is now named submodule-upstream for uniformity, as the name doesn't matter.

I've performed a few additional simplifications that I believe are proper, but we'll need to watch out for problems when agents reuse repos:

* After top-level self-checkout, each submodule directory should exist, so we shouldn't have to force-create it.
* If the .git directory doesn't exist in the submodule, we shouldn't need to obliterate all other files there. We're going to perform a checkout and clean that should restore us to a known good state.
* If the .git directory already exists, running `git init` again is harmless by design.
* Instead of the "run `git remote get-url` and look for failure" technique, we can check the output of `git remote` to make `git remote add` idempotent.
@StephanTLavavej
Copy link
Member Author

/azp run STL-ASan-CI

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej changed the title Azure Pipelines refactoring WIP Toolset update: VS 2022 17.10 Preview 4, non-spot VMs, Azure Pipelines overhaul Apr 17, 2024
@StephanTLavavej StephanTLavavej removed the uncharted Excluded from the Status Chart label Apr 17, 2024
Copy link
Member

@CaseyCarter CaseyCarter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make it much less painful to work on the pipelines in the future. Thanks!

@CaseyCarter CaseyCarter removed their assignment Apr 18, 2024
@CaseyCarter
Copy link
Member

/azp run STL-ASan-CI

STL-ASan-CI passed.

@StephanTLavavej StephanTLavavej self-assigned this Apr 18, 2024
@StephanTLavavej
Copy link
Member Author

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit a35fb2a into microsoft:main Apr 19, 2024
39 checks passed
@StephanTLavavej StephanTLavavej deleted the azure branch April 19, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Related to repository automation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants