Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docker builder prune to environment hook #1251

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jim-barber-he
Copy link

When the environment hook detects that there is not enough disk space on the agent it invokes docker image prune.

On our agents with 64 GiB disks we are finding that our agents are filling up too quickly and even when we set DOCKER_PRUNE_UNTIL=30m not enough us cleaned up.

The packer/linux/conf/docker/scripts/docker-low-disk-gc script also has similar logic, except in addition to docker image prune it runs docker builder prune.

On one of our agents that had started failing builds due to full disk and where the environment hook's disk clean up was not freeing up enough disk space, I manually ran the docker builder prune command with appropriate command line arguments and it freed up 20 GB of disk for us. So it seems that it would be beneficial to run this in the environment hook as well.

When the environment hook detects that there is not enough disk space on
the agent it invokes `docker image prune`.

On our agents with 64 GiB disks we are finding that our agents are
filling up too quickly and even when we set `DOCKER_PRUNE_UNTIL=30m` not
enough us cleaned up.

The `packer/linux/conf/docker/scripts/docker-low-disk-gc` script also
has similar logic, except in addition to `docker image prune` it runs
`docker builder prune`.

On one of our agents that had started failing builds due to full disk
and where the environment hook's disk clean up was not freeing up enough
disk space, I manually ran the `docker builder prune` command with
appropriate command line arguments and it freed up 20 GB of disk for us.
So it seems that it would be beneficial to run this in the environment
hook as well.
@moskyb
Copy link
Contributor

moskyb commented Nov 1, 2023

hi there @jim-barber-he! i get where you're coming from here - the lifecycle of docker images and builders on the elastic stack can sometimes build up, and it's a less-than-awesome user experience.

unfortunately, in this case, i don't think that this solution will be sufficient; if builder prunes run while a job is running a docker build, we've found cases where it'll cause those image builds to fail - see #662, so applying this logic to all elastic stack users would probably cause some problems for other users.

if you're running across this regularly, i'd recommend doing one (or more) of the following:

  • Launch instances with larger disks
  • Have shorter instance lifetimes - elastic stack instances cycle out after 10 minutes of not running a job, but you could make this shorter and/or increase the number of instances to increase the likelihood of an instance timing itself out
  • Add this logic to a hook for your instances only - see https://buildkite.com/docs/agent/v3/hooks

@jim-barber-he
Copy link
Author

Is that issue you posted applicable to Docker Buildkit?
I thought that was something that Buildkit was supposed to solve that the old docker builder didn't handle?
When elastic-ci-stack-for-aws switched to version 6 the builds seemed to change to be done via Docker Buildkit by default.

In our case our disks are growing by around 2GB every 10 mins (when the docker image prune finds nothing to clean up).
For some reason this became more of a problem when we moved to v6 of the elastic-ci-stack-for-aws so I don't know if that's because Buildkit does things differently to how the old docker builds did.

We use RAM disks to speed up our builds, so moving to larger instance types is cost prohibitive.

In our case shorter lifetimes probably won't help because we pretty much have pipelines going all throughout the work day.
We have agents that are often up for hours and we haven't changed the defaults for the lifetimes.

I've hacked this into our agents and it has saved us from having major problems, however I used a sed hack via the bootstrap script to modify your hook which is probably brittle so I'll look into using the job hooks instead.

@DrJosh9000
Copy link
Contributor

Turning this on for everyone is risky. On the other hand, we'll probably never know who it would break without doing that 🤔

Perhaps the right solution is to tune the builder GC parameters in the Docker Daemon configuration: https://docs.docker.com/build/cache/garbage-collection/ From a quick look, the defaults seem reasonable for a disk cache, but could potentially fill a RAM disk quickly (maybe Docker is making assumptions about where the cache is stored, and computes an inappropriate Keep Bytes?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants