Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmz: use overlayfs to write-protect /proc/self/exe if possible #4448

Merged
merged 2 commits into from
Oct 20, 2024

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Oct 16, 2024

Commit b999376 ("nsenter: cloned_binary: remove bindfd logic
entirely") removed the read-only bind-mount logic from our cloned binary
code because it wasn't really safe because a container with
CAP_SYS_ADMIN could remove the MS_RDONLY bit and get write access to
/proc/self/exe (even with user namespaces this could've been an issue
because it's not clear if the flags are locked).

However, copying a binary does seem to have a minor performance impact.
The only way to have no performance impact would be for the kernel to
block these write attempts, but barring that we could try to reduce the
overhead by coming up with a mount that cannot have it's read-only bits
cleared.

The "simplest" solution is to create a temporary overlayfs using
fsopen(2) which uses the directory where runc exists as a lowerdir,
ensuring that the container cannot access the underlying file -- and we
don't have to do any copies.

While fsopen(2) is not free because mount namespace cloning is usually
expensive (and so it seems like the difference would be marginal), some
basic performance testing seems to indicate there is a ~60% improvement
doing it this way and that it has effectively no overhead even when
compared to just using /proc/self/exe directly:

% hyperfine --warmup 50 \
>           "./runc-noclone run -b bundle ctr" \
>           "./runc-overlayfs run -b bundle ctr" \
>           "./runc-memfd run -b bundle ctr"

Benchmark 1: ./runc-noclone run -b bundle ctr
  Time (mean ± σ):      13.7 ms ±   0.9 ms    [User: 6.0 ms, System: 10.9 ms]
  Range (min … max):    11.3 ms …  16.1 ms    184 runs

Benchmark 2: ./runc-overlayfs run -b bundle ctr
  Time (mean ± σ):      13.9 ms ±   0.9 ms    [User: 6.2 ms, System: 10.8 ms]
  Range (min … max):    11.8 ms …  16.0 ms    180 runs

Benchmark 3: ./runc-memfd run -b bundle ctr
  Time (mean ± σ):      22.6 ms ±   1.3 ms    [User: 5.7 ms, System: 20.7 ms]
  Range (min … max):    19.9 ms …  26.5 ms    114 runs

Summary
  ./runc-noclone run -b bundle ctr ran
    1.01 ± 0.09 times faster than ./runc-overlayfs run -b bundle ctr
    1.65 ± 0.15 times faster than ./runc-memfd run -b bundle ctr

Signed-off-by: Aleksa Sarai [email protected]

@cyphar cyphar force-pushed the cloned-binary-overlayfs branch 3 times, most recently from a16141f to 581d923 Compare October 16, 2024 07:07
@cyphar
Copy link
Member Author

cyphar commented Oct 16, 2024

FWIW, I did some tests and this is about the same performance as 1.1.14, while 1.1.15 and all of 1.2.x have the memfd ~60% performance hit. Seems like this might be the way to go...

logrus.Debugf("using overlayfs for /proc/self/exe sealing")
return overlayFile, nil
}
logrus.Debugf("could not use overlayfs for /proc/self/exe sealing (%v) -- falling back to standard memfd copy", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs comment lines to compare overlayfs with standard memfd copy (and also with bind-mount)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a short paragraph or two. Let me know if you think it needs more information.

@cyphar cyphar force-pushed the cloned-binary-overlayfs branch from 581d923 to c23ab45 Compare October 16, 2024 11:24
@cyphar cyphar marked this pull request as ready for review October 16, 2024 13:39
@kolyshkin
Copy link
Contributor

I suggest using a benchmark I added in #4432, it's closer to runc init, so to say.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have yet to take a closer look, but my biggest concert is having a new mount will trigger systemd which is subscribed to mount/umount events and re-reads /proc/self/mountinfo to sync the internal state of mount units. With many quick runc exec invocations this degrades system performance a lot.

Or is this done in a mount namespace which host systemd can't see?

Comment on lines 97 to 99
// The only reasonable option would be to hash both files and compare them,
// but this would require fully reading both files which would produce a
// similar performance overhead to memfd cloning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would kcmp(2) work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no, the two files are completely different from the kernel's perspective (if they were the same we could just check the inode number and device).

@cyphar
Copy link
Member Author

cyphar commented Oct 17, 2024

@kolyshkin The mount is created with fsopen and is never placed on the filesystem, so systemd can't see it (nor can any other process). The main benefit of using fsopen is that we don't need to configure a custom mount namespace because the "new" mount infrastructure creates an anonymous mount namespace internally.

I have a feeling that benchmarking that way would give unrealistically good performance numbers for this approach, and comparing a full container creation/shutdown with the actual runc binary is more indicative of the impact on a real workload. But I'll run the benchmarks and post them here.

@cyphar
Copy link
Member Author

cyphar commented Oct 17, 2024

Using #4432, here are the performance numbers for my machine:

baseline (no /proc/self/exe protection)

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                1430           4196755 ns/op
BenchmarkExecTrue-16                1430           4198300 ns/op
BenchmarkExecTrue-16                1416           4205281 ns/op
BenchmarkExecTrue-16                1408           4212323 ns/op
BenchmarkExecTrue-16                1432           4217341 ns/op
PASS

memfd

!runc_nodmz

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                 460          12835217 ns/op
BenchmarkExecTrue-16                 470          12892909 ns/op
BenchmarkExecTrue-16                 464          12985902 ns/op
BenchmarkExecTrue-16                 469          12892199 ns/op
BenchmarkExecTrue-16                 464          13099146 ns/op
PASS

runc_nodmz

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                 458          12993715 ns/op
BenchmarkExecTrue-16                 463          13001347 ns/op
BenchmarkExecTrue-16                 463          12999835 ns/op
BenchmarkExecTrue-16                 468          13006262 ns/op
BenchmarkExecTrue-16                 463          12969887 ns/op
PASS

runc-dmz

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                1393           4247870 ns/op
BenchmarkExecTrue-16                1404           4271359 ns/op
BenchmarkExecTrue-16                1406           4274319 ns/op
BenchmarkExecTrue-16                1414           4250608 ns/op
BenchmarkExecTrue-16                1392           4227626 ns/op
PASS

memfd-bind

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                1441           4192545 ns/op
BenchmarkExecTrue-16                1423           4238741 ns/op
BenchmarkExecTrue-16                1418           4201058 ns/op
BenchmarkExecTrue-16                1405           4216821 ns/op
BenchmarkExecTrue-16                1438           4207067 ns/op
PASS

overlayfs

goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/integration
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkExecTrue
BenchmarkExecTrue-16                1374           4349093 ns/op
BenchmarkExecTrue-16                1372           4363986 ns/op
BenchmarkExecTrue-16                1354           4351277 ns/op
BenchmarkExecTrue-16                1386           4382160 ns/op
BenchmarkExecTrue-16                1336           4368590 ns/op
PASS

The key takeaways are that overlayfs has ~2% overhead (with some error bars) compared to memfd-bind or runc-dmz but without the downsides (memfd-bind has administrative complexity and limits what users can run the runc binary, and runc-dmz has all of the fun capability issues). memfd-bind and runc-dmz are have ~1% overhead compared to doing nothing. That being said, on my machine the overhead of copying appears to be >300% if you use this benchmark, which seems a little suspect.

My previous testing seems to indicate that all of these single-digit-percentage overheads basically become noise when you actually run runc as a binary.

@cyphar cyphar force-pushed the cloned-binary-overlayfs branch 2 times, most recently from fb01a0c to d3ea5aa Compare October 17, 2024 09:17
// effectively no performance overhead (it is on par with both
// MS_BIND+MS_RDONLY and no binary cloning at all) while memfd copying adds
// around ~60% overhead during container startup.
overlayFile, err := sealedOverlayfs("/proc/self/exe", tmpDir)
Copy link
Member

@rata rata Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need the kernel locks that affected clusters with churn using the bind mount logic before? I mean, that was causing this issue: #2532

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That particular issue was because of systemd's mount tracking, which doesn't apply here (the mounts are internal to the process -- systemd will never see them).

However, to be honest I expected it to have an impact, but it doesn't have a noticeable one. Looking into it, the main reason is that CLONE_NEWNS actually requires taking namespace_lock() to make a copy of the existing mount table, but the anonymous allocation of a private mount namespace doesn't and so there is no lock contention on global locks. So there is no locking issue.

It's a bit hard to compare the performance of bind-mount because the code was removed in 1.2.x, so I'd need to write a new version, but if you compare the performance with 1.1.14 this patch is ~5% faster (using the hyperfine benchmark). I can try to benchmark against a synthetic version of bindfd on 1.2.x.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, I don't think we need it in 1.2 just to compare. IMHO it is more interesting to create a way to repro #2532 and make sure this doesn't cause similar consequences.

Copy link
Member Author

@cyphar cyphar Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can't reproduce that exact issue, but we could have a daemon in our tests that tracks whether there are any new mounts created during testing (which is what that issue is really about) and error out if runc ever creates mounts on the host.

Unfortunately, there is still no notification API for mounts (we discussed designing one based on fanotify at LSF/MM this year but that won't be in a kernel we can use in CI for a while) so we would need to either loop over /proc/self/mountinfo or preferably use listmount(2) to just get a list of mount IDs once CI supports it (Linux 6.8). We might not be able to catch races though, and I can't see an obvious way of avoiding that (mount propagation could be used to copy the mount elsewhere but MS_SLAVE wouldn't stop the umount from propagating). We might need to run runc several hundred times to try to catch a race.

A slightly crazy idea would be to use bpftrace to detect it, but that would be far too brittle for CI.

Copy link
Member

@rata rata Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah, I think CI for this sounds complicated, at least for now. I'll check if I can repro locally and then give this a try, this sounds enough for now. I'll update if I manage to do it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there is still no notification API for mounts

The one we have (and which systemd is using) is epoll on /proc/self/mountinfo.

Caveats:

  • still have to read the file (or use listmount(2)) once an event is received;
  • easy to miss a short-lived mount.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right! In that case we can just listen for any epoll event and error out if we see one. If we assume the system is not doing other mounts on the host during tests (a reasonable assumption) we should not see any mount events at all during the test.

@rata
Copy link
Member

rata commented Oct 17, 2024

A completely different idea that I want to experiment with is using landlock (although not all kernels support it, it is supported for a while already). Landlock does work with magic-links too, which is great.

I'm not sure if it would be easy to integrate, though, for this protection. But probably if we can make it work, it would be a very light protection (when the kernel supports it).

@cyphar
Copy link
Member Author

cyphar commented Oct 17, 2024

I'm not sure landlock can protect against every possible way of accessing /proc/self/exe. Unless I'm missing something you would need to restrict /proc access in pretty serious ways for the entire container (in ways that would break container programs that use /proc/self/fd for re-opening or open /proc/self/exe). Landlock does have the ability to track some restrictions on file descriptors but not the ones we need (and I'm not entirely sure it would be as fool-proof as overlayfs).

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is nice 👍🏻

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice idea! The code LGTM and the tests I did seem to work just fine too :)

I did some stress-tests for possible regressions on #2532. This is what I used to test:

  1. configure containerd to use systemd cgroups
  2. start a k8s cluster
  3. create a pod with 100 replicas and readinessProbe that exec's into the container, every 1 second

I did this with runc 1.1.14, 1.1.15 and runc main with this PR applied.

runc 1.1.15 and runc main showed only containerd/kubelet with high CPU usage (20-40%), after the containers were created udisksd is not using significant CPU (during creating it uses more CPU, but we are doing mounts for the new containers there) and I don't see any mount things in "journalctl -f".

With runc 1.1.14, the udisks process is taking ~40%CPU, with peaks ~55%, constantly (after the containers are running too), the kubelet/containerd take 20/40%CPU and I see entries in the journalctl -f output as follows, even after all containers are running:

Oct 18 14:15:50 lindsay systemd[1]: run-containerd-runc-k8s.io-b1deab2f119eccb4447dd9afbca48c615f7a24246e76840725cb3b7e895b6c5e-runc.zSElNB.mount: Deactivated successfully.

Therefore, I can't notice any regressions here for that issue. This is great :)

click to see the pod.yaml and containerd config

pod.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-apache
  namespace: default
spec:
  replicas: 100
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - image: registry.k8s.io/hpa-example
        imagePullPolicy: Always
        livenessProbe:
          exec:
            command:
            - curl
            - http://localhost:80
          failureThreshold: 1
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 10
        name: php-apache
        ports:
        - containerPort: 80
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - curl
            - http://localhost:80
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 30m

containerd config:

version = 3
root = '/var/lib/containerd-rata'
state = '/run/containerd-rata'
temp = ''
plugin_dir = ''
disabled_plugins = []
required_plugins = []
oom_score = 0
imports = []

[grpc]
  address = '/run/containerd-rata/containerd.sock'
  tcp_address = ''
  tcp_tls_ca = ''
  tcp_tls_cert = ''
  tcp_tls_key = ''
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[ttrpc]
  address = ''
  uid = 0
  gid = 0

[debug]
  address = ''
  uid = 0
  gid = 0
  level = ''
  format = ''

[metrics]
  address = ''
  grpc_histogram = false

[plugins]
  [plugins.'io.containerd.cri.v1.images']
    snapshotter = 'overlayfs'
    disable_snapshot_annotations = true
    discard_unpacked_layers = false
    max_concurrent_downloads = 3
    image_pull_progress_timeout = '5m0s'
    image_pull_with_sync_fs = false
    stats_collect_period = 10

    [plugins.'io.containerd.cri.v1.images'.pinned_images]
      sandbox = 'registry.k8s.io/pause:3.10'

    [plugins.'io.containerd.cri.v1.images'.registry]
      config_path = ''

    [plugins.'io.containerd.cri.v1.images'.image_decryption]
      key_model = 'node'

  [plugins.'io.containerd.cri.v1.runtime']
    enable_selinux = false
    selinux_category_range = 1024
    max_container_log_line_size = 16384
    disable_apparmor = false
    restrict_oom_score_adj = false
    disable_proc_mount = false
    unset_seccomp_profile = ''
    tolerate_missing_hugetlb_controller = true
    disable_hugetlb_controller = true
    device_ownership_from_security_context = false
    ignore_image_defined_volumes = false
    netns_mounts_under_state_dir = false
    enable_unprivileged_ports = true
    enable_unprivileged_icmp = true
    enable_cdi = true
    cdi_spec_dirs = ['/etc/cdi', '/var/run/cdi']
    drain_exec_sync_io_timeout = '0s'
    ignore_deprecation_warnings = []

    [plugins.'io.containerd.cri.v1.runtime'.containerd]
      default_runtime_name = 'runc'
      ignore_blockio_not_enabled_errors = false
      ignore_rdt_not_enabled_errors = false

      [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes]
        [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
          runtime_type = 'io.containerd.runc.v2'
          runtime_path = ''
          pod_annotations = []
          container_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          base_runtime_spec = ''
          cni_conf_dir = ''
          cni_max_conf_num = 0
          snapshotter = ''
          sandboxer = 'podsandbox'
          io_type = ''

          [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
            BinaryName = ''
            CriuImagePath = ''
            CriuWorkPath = ''
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            Root = ''
            ShimCgroup = ''
            SystemdCgroup = true

    [plugins.'io.containerd.cri.v1.runtime'.cni]
      bin_dir = '/opt/cni/bin'
      conf_dir = '/etc/cni/net.d'
      max_conf_num = 1
      setup_serially = false
      conf_template = ''
      ip_pref = ''
      use_internal_loopback = false

  [plugins.'io.containerd.gc.v1.scheduler']
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = '0s'
    startup_delay = '100ms'

  [plugins.'io.containerd.grpc.v1.cri']
    disable_tcp_service = true
    stream_server_address = '127.0.0.1'
    stream_server_port = '0'
    stream_idle_timeout = '4h0m0s'
    enable_tls_streaming = false

    [plugins.'io.containerd.grpc.v1.cri'.x509_key_pair_streaming]
      tls_cert_file = ''
      tls_key_file = ''

  [plugins.'io.containerd.image-verifier.v1.bindir']
    bin_dir = '/opt/containerd/image-verifier/bin'
    max_verifiers = 10
    per_verifier_timeout = '10s'

  [plugins.'io.containerd.internal.v1.opt']
    path = '/opt/containerd'

  [plugins.'io.containerd.internal.v1.tracing']

  [plugins.'io.containerd.metadata.v1.bolt']
    content_sharing_policy = 'shared'

  [plugins.'io.containerd.monitor.container.v1.restart']
    interval = '10s'

  [plugins.'io.containerd.monitor.task.v1.cgroups']
    no_prometheus = false

  [plugins.'io.containerd.nri.v1.nri']
    disable = false
    socket_path = '/var/run/nri/nri.sock'
    plugin_path = '/opt/nri/plugins'
    plugin_config_path = '/etc/nri/conf.d'
    plugin_registration_timeout = '5s'
    plugin_request_timeout = '2s'
    disable_connections = false

  [plugins.'io.containerd.runtime.v2.task']
    platforms = ['linux/amd64']

  [plugins.'io.containerd.service.v1.diff-service']
    default = ['walking']
    sync_fs = false

  [plugins.'io.containerd.service.v1.tasks-service']
    blockio_config_file = ''
    rdt_config_file = ''

  [plugins.'io.containerd.shim.v1.manager']
    env = []

  [plugins.'io.containerd.snapshotter.v1.blockfile']
    root_path = ''
    scratch_file = ''
    fs_type = ''
    mount_options = []
    recreate_scratch = false

  [plugins.'io.containerd.snapshotter.v1.btrfs']
    root_path = ''

  [plugins.'io.containerd.snapshotter.v1.devmapper']
    root_path = ''
    pool_name = ''
    base_image_size = ''
    async_remove = false
    discard_blocks = false
    fs_type = ''
    fs_options = ''

  [plugins.'io.containerd.snapshotter.v1.native']
    root_path = ''

  [plugins.'io.containerd.snapshotter.v1.overlayfs']
    root_path = ''
    upperdir_label = false
    sync_remove = false
    slow_chown = false
    mount_options = []

  [plugins.'io.containerd.tracing.processor.v1.otlp']

  [plugins.'io.containerd.transfer.v1.local']
    max_concurrent_downloads = 3
    max_concurrent_uploaded_layers = 3
    config_path = ''

[cgroup]
  path = ''

[timeouts]
  'io.containerd.timeout.bolt.open' = '0s'
  'io.containerd.timeout.metrics.shimstats' = '2s'
  'io.containerd.timeout.shim.cleanup' = '5s'
  'io.containerd.timeout.shim.load' = '5s'
  'io.containerd.timeout.shim.shutdown' = '3s'
  'io.containerd.timeout.task.state' = '2s'

[stream_processors]
  [stream_processors.'io.containerd.ocicrypt.decoder.v1.tar']
    accepts = ['application/vnd.oci.image.layer.v1.tar+encrypted']
    returns = 'application/vnd.oci.image.layer.v1.tar'
    path = 'ctd-decoder'
    args = ['--decryption-keys-path', '/etc/containerd/ocicrypt/keys']
    env = ['OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf']

  [stream_processors.'io.containerd.ocicrypt.decoder.v1.tar.gzip']
    accepts = ['application/vnd.oci.image.layer.v1.tar+gzip+encrypted']
    returns = 'application/vnd.oci.image.layer.v1.tar+gzip'
    path = 'ctd-decoder'
    args = ['--decryption-keys-path', '/etc/containerd/ocicrypt/keys']
    env = ['OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf']

@lifubang
Copy link
Member

Need rebase.

@cyphar
Copy link
Member Author

cyphar commented Oct 18, 2024

Btw this approach can also work for rootless containers as well on new enough kernels (overlayfs was enabled inside user namespaces in 5.11), but it would require doing some awful fork+CGo stuff so we can punt on this for now.

@cyphar cyphar force-pushed the cloned-binary-overlayfs branch from d3ea5aa to a2ac644 Compare October 18, 2024 14:36
# use a temporary overlayfs instead of making a memfd clone of
# /proc/self/exe.
[[ "$output" = *"runc-dmz: using overlayfs for sealed /proc/self/exe"* ]]
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t seem to work when running as the UID 0 in a userNS with a vanilla kernel 5.1-5.10 that lacks support for mounting overlayfs in a userNS

Copy link
Member Author

@cyphar cyphar Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but does anyone run the whole test suite inside a userns (maybe rootlesskit?)? I can change this test to check for that, but runc it self will gracefully fall back to doing memfds in that case.

@cyphar cyphar force-pushed the cloned-binary-overlayfs branch 2 times, most recently from cf0baa2 to c098b3e Compare October 20, 2024 08:58
@cyphar cyphar force-pushed the cloned-binary-overlayfs branch from c098b3e to 2e6fc77 Compare October 20, 2024 08:58
Commit b999376 ("nsenter: cloned_binary: remove bindfd logic
entirely") removed the read-only bind-mount logic from our cloned binary
code because it wasn't really safe because a container with
CAP_SYS_ADMIN could remove the MS_RDONLY bit and get write access to
/proc/self/exe (even with user namespaces this could've been an issue
because it's not clear if the flags are locked).

However, copying a binary does seem to have a minor performance impact.
The only way to have no performance impact would be for the kernel to
block these write attempts, but barring that we could try to reduce the
overhead by coming up with a mount that cannot have it's read-only bits
cleared.

The "simplest" solution is to create a temporary overlayfs using
fsopen(2) which uses the directory where runc exists as a lowerdir,
ensuring that the container cannot access the underlying file -- and we
don't have to do any copies.

While fsopen(2) is not free because mount namespace cloning is usually
expensive (and so it seems like the difference would be marginal), some
basic performance testing seems to indicate there is a ~60% improvement
doing it this way and that it has effectively no overhead even when
compared to just using /proc/self/exe directly:

  % hyperfine --warmup 50 \
  >           "./runc-noclone run -b bundle ctr" \
  >           "./runc-overlayfs run -b bundle ctr" \
  >           "./runc-memfd run -b bundle ctr"

  Benchmark 1: ./runc-noclone run -b bundle ctr
    Time (mean ± σ):      13.7 ms ±   0.9 ms    [User: 6.0 ms, System: 10.9 ms]
    Range (min … max):    11.3 ms …  16.1 ms    184 runs

  Benchmark 2: ./runc-overlayfs run -b bundle ctr
    Time (mean ± σ):      13.9 ms ±   0.9 ms    [User: 6.2 ms, System: 10.8 ms]
    Range (min … max):    11.8 ms …  16.0 ms    180 runs

  Benchmark 3: ./runc-memfd run -b bundle ctr
    Time (mean ± σ):      22.6 ms ±   1.3 ms    [User: 5.7 ms, System: 20.7 ms]
    Range (min … max):    19.9 ms …  26.5 ms    114 runs

  Summary
    ./runc-noclone run -b bundle ctr ran
      1.01 ± 0.09 times faster than ./runc-overlayfs run -b bundle ctr
      1.65 ± 0.15 times faster than ./runc-memfd run -b bundle ctr

Signed-off-by: Aleksa Sarai <[email protected]>
@cyphar cyphar force-pushed the cloned-binary-overlayfs branch from 2e6fc77 to 515f09f Compare October 20, 2024 10:35
@@ -396,7 +402,7 @@ function requires() {
fi
;;
root)
if [ $EUID -ne 0 ]; then
if [ $EUID -ne 0 ] || in_userns; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this change probably results in many tests being skipped which were not skipped before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't run the suite in user namespaces, though? But yeah this change wasn't strictly necessary so we can revert it if it's causing issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants