Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: consider read only sysctl errors as non fatal #825

Open
Luap99 opened this issue Oct 17, 2023 · 13 comments · May be fixed by #910
Open

RFC: consider read only sysctl errors as non fatal #825

Luap99 opened this issue Oct 17, 2023 · 13 comments · May be fixed by #910

Comments

@Luap99
Copy link
Member

Luap99 commented Oct 17, 2023

When running inside of unprivileged containers /proc is normally mounted read only.
Now if a users tries to run netavark it will fail hard if we cannot set all the sysctl's. Most of them are needed for routing or to disable some ipv6 options but general communication may still be possible.

We should consider not treating read only errors as fatal and just log them as warning. The biggest problem is likely the ip_forward sysctl, without it no external communication would be possible. However this could already be set by the outer container manager in which case I would expect it to mostly work fine.

see containers/podman#19991

@dnewhook
Copy link

Did this PR try to fix this issue? #825

@Luap99
Copy link
Member Author

Luap99 commented Nov 30, 2023

What PR? You link to this issue.

@dnewhook
Copy link

Sorry I meant this PR: #333 (Support read only /proc) for this issue: #330

@Luap99
Copy link
Member Author

Luap99 commented Nov 30, 2023

No, that PR only works if you already have the right sysctl values. This is about not having the right sysctl set and just ignoring it if we cannot set it. But this most likely means routing is non functional so I am not sure if this is a good idea.

@dnewhook
Copy link

Thanks for the info. The right sysctls being mentioned in the following issue right? #362

--kubelet-extra-args '--allowed-unsafe-sysctls="net.ipv4.conf.default.route_localnet"'

@Omar007
Copy link

Omar007 commented Feb 14, 2024

I ran into this today setting up a rootless and unprivileged podman deployment inside of a k8s cluster. Here is the PodSpec for reference:

containers:
  - image: quay.io/podman/stable:v4.9.0
    name: podman
    command:
      - podman
      - system
      - service
      - --log-level
      - debug
      - --transient-store
      - --time
      - "0"
      - tcp://localhost:2375
    securityContext:
      runAsUser: 1000
      runAsGroup: 1000
    resources:
      limits:
        squat.ai/fuse: 1
        squat.ai/tun: 1

I would've expected it to only set hard-required sysctls and ignore/not write any that already have the correct value or are optional but it seems it just tries to set them unconditionally causing problems for this setup.

I had tried to set the following sysctls using the PodSpec's securityContext.sysctls field as needed (from what I can tell) to no avail. I also attempted a run with IPv6 disabled (network_cmd_options=["enable_ipv6=false"]) which took care of some of the sysctl writes (maybe from slirp4netns though, not netavark).

sysctl value note
net.ipv4.ip_forward 1
net.ipv4.conf.default.arp_notify 1
net.ipv6.conf.[default|eth0].autoconf 0
net.ipv6.conf.default.accept_dad 0 Only when enable_ipv6=true? Logged as a warning
net.ipv6.conf.default.accept_ra 0 Only when enable_ipv6=true?

Checking from within the running pods, the sysctls have the values as set on the PodSpec and match the value netavark would write (which it still did even though it's already correct).
Logs show, for instance, [DEBUG netavark::network::core_utils] Setting sysctl value for net.ipv4.ip_forward to 1 (it already is) and then it fails with time="2024-02-14T17:10:53Z" level=info msg="Request Failed(Internal Server Error): netavark (exit code 1): Sysctl error: IO Error: Read-only file system (os error 30)"

@Luap99
Copy link
Member Author

Luap99 commented Feb 14, 2024

I would've expected it to only set hard-required sysctls and ignore/not write any that already have the correct value or are optional but it seems it just tries to set them unconditionally causing problems for this setup.

We already first read the value and then only set it if it does not have the correct value.

@Omar007
Copy link

Omar007 commented Feb 14, 2024

Are there any sysctls missing or using incorrect values in the above table it doesn't log about then? If it's not writing if the sysctls are set to the expected values, I would not expect it to fail for not being able to write (if it does indeed not write anything in that case) 🤔

@algompluecker
Copy link

I am stuck with the same issue. Any idea how to resolve this? Is there maybe another place where it writes to /proc ?

@Omar007
Copy link

Omar007 commented Jun 7, 2024

I recently took another shot at this with podman 5 but things have not changed on my end sadly. There's no documentation on what sysctl values are expected to be set or attempted to be set, what capabilities or filesystem access is needed, nothing.

The only information[1][2][3][4] I've found thus far suggests there's no need for it to be privileged, no need for (NET_ADMIN) capabilities, no need to set sysctls if they are already set correctly[5][6], etc.

Running podman without any networking seems to suggest this might actually be true but the moment networking is involved, it all falls apart.

That means it either just can't work rootless or without special privileges/capabilities at all yet if networking is involved (doubtful, everyone involved seems to present outward that it does), assumptions are made around the underlying systems/runtimes (e.g. device access, minimum set of capabilities, ...) and/or the documentation is incorrect/missing information (quite likely).
A quick test with dropping ALL capabilities quickly shows that podman needs at least SETGID and SETUID regardless for instance.

It also turns out we can request a trace log level but this did not contain any more information I could personally use to diagnose what is going on:

$  podman run docker.io/library/busybox:latest echo 123
123
$  podman run --log-level trace --network podman docker.io/library/busybox:latest echo 123
INFO[0000] podman filtering at log level trace
DEBU[0000] Called run.PersistentPreRunE(podman run --log-level trace --network podman docker.io/library/busybox:latest echo 123)
DEBU[0000] Using conmon: "/usr/bin/conmon"
INFO[0000] Using sqlite as database backend
DEBU[0000] systemd-logind: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory
DEBU[0000] Using graph driver overlay
DEBU[0000] Using graph root /home/podman/.local/share/containers/storage
DEBU[0000] Using run root /tmp/storage-run-1000/containers
DEBU[0000] Using static dir /home/podman/.local/share/containers/storage/libpod
DEBU[0000] Using tmp dir /tmp/storage-run-1000/libpod/tmp
DEBU[0000] Using volume path /home/podman/.local/share/containers/storage/volumes
DEBU[0000] Using transient store: false
DEBU[0000] Not configuring container store
DEBU[0000] Initializing event backend file
DEBU[0000] Configured OCI runtime runj initialization failed: no valid executable found for OCI runtime runj: invalid argument
DEBU[0000] Configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0000] Configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0000] Configured OCI runtime youki initialization failed: no valid executable found for OCI runtime youki: invalid argument
DEBU[0000] Configured OCI runtime krun initialization failed: no valid executable found for OCI runtime krun: invalid argument
DEBU[0000] Configured OCI runtime ocijail initialization failed: no valid executable found for OCI runtime ocijail: invalid argument
TRAC[0000] found runtime "/usr/bin/crun"
DEBU[0000] Configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0000] Configured OCI runtime crun-vm initialization failed: no valid executable found for OCI runtime crun-vm: invalid argument
DEBU[0000] Configured OCI runtime crun-wasm initialization failed: no valid executable found for OCI runtime crun-wasm: invalid argument
DEBU[0000] Using OCI runtime "/usr/bin/crun"
INFO[0000] Setting parallel job count to 193
DEBU[0000] Could not move to subcgroup: mkdir /sys/fs/cgroup/init: read-only file system
INFO[0000] podman filtering at log level trace
DEBU[0000] Called run.PersistentPreRunE(podman run --log-level trace --network podman docker.io/library/busybox:latest echo 123)
DEBU[0000] Using conmon: "/usr/bin/conmon"
INFO[0000] Using sqlite as database backend
DEBU[0000] systemd-logind: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory
DEBU[0000] Using graph driver overlay
DEBU[0000] Using graph root /home/podman/.local/share/containers/storage
DEBU[0000] Using run root /tmp/storage-run-1000/containers
DEBU[0000] Using static dir /home/podman/.local/share/containers/storage/libpod
DEBU[0000] Using tmp dir /tmp/storage-run-1000/libpod/tmp
DEBU[0000] Using volume path /home/podman/.local/share/containers/storage/volumes
DEBU[0000] Using transient store: false
DEBU[0000] [graphdriver] trying provided driver "overlay"
DEBU[0000] overlay: storage already configured with a mount-program
DEBU[0000] backingFs=overlayfs, projectQuotaSupported=false, useNativeDiff=false, usingMetacopy=false
DEBU[0000] Initializing event backend file
DEBU[0000] Configured OCI runtime crun-vm initialization failed: no valid executable found for OCI runtime crun-vm: invalid argument
DEBU[0000] Configured OCI runtime runc initialization failed: no valid executable found for OCI runtime runc: invalid argument
DEBU[0000] Configured OCI runtime runj initialization failed: no valid executable found for OCI runtime runj: invalid argument
DEBU[0000] Configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument
DEBU[0000] Configured OCI runtime youki initialization failed: no valid executable found for OCI runtime youki: invalid argument
TRAC[0000] found runtime "/usr/bin/crun"
DEBU[0000] Configured OCI runtime crun-wasm initialization failed: no valid executable found for OCI runtime crun-wasm: invalid argument
DEBU[0000] Configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument
DEBU[0000] Configured OCI runtime krun initialization failed: no valid executable found for OCI runtime krun: invalid argument
DEBU[0000] Configured OCI runtime ocijail initialization failed: no valid executable found for OCI runtime ocijail: invalid argument
DEBU[0000] Using OCI runtime "/usr/bin/crun"
INFO[0000] Setting parallel job count to 193
DEBU[0000] Could not move to subcgroup: mkdir /sys/fs/cgroup/init: read-only file system
DEBU[0000] Pulling image docker.io/library/busybox:latest (policy: missing)
DEBU[0000] Looking up image "docker.io/library/busybox:latest" in local containers storage
DEBU[0000] Normalized platform linux/amd64 to {amd64 linux  [] }
DEBU[0000] Trying "docker.io/library/busybox:latest" ...
DEBU[0000] parsed reference into "[overlay@/home/podman/.local/share/containers/storage+/tmp/storage-run-1000/containers]@65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac"
DEBU[0000] Found image "docker.io/library/busybox:latest" as "docker.io/library/busybox:latest" in local containers storage
DEBU[0000] Found image "docker.io/library/busybox:latest" as "docker.io/library/busybox:latest" in local containers storage ([overlay@/home/podman/.local/share/containers/storage+/tmp/storage-run-1000/containers]@65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac)
DEBU[0000] exporting opaque data as blob "sha256:65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac"
DEBU[0000] Looking up image "docker.io/library/busybox:latest" in local containers storage
DEBU[0000] Normalized platform linux/amd64 to {amd64 linux  [] }
DEBU[0000] Trying "docker.io/library/busybox:latest" ...
DEBU[0000] parsed reference into "[overlay@/home/podman/.local/share/containers/storage+/tmp/storage-run-1000/containers]@65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac"
DEBU[0000] Found image "docker.io/library/busybox:latest" as "docker.io/library/busybox:latest" in local containers storage
DEBU[0000] Found image "docker.io/library/busybox:latest" as "docker.io/library/busybox:latest" in local containers storage ([overlay@/home/podman/.local/share/containers/storage+/tmp/storage-run-1000/containers]@65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac)
DEBU[0000] exporting opaque data as blob "sha256:65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac"
DEBU[0000] User mount /proc:/proc options []
DEBU[0000] Looking up image "docker.io/library/busybox:latest" in local containers storage
DEBU[0000] Normalized platform linux/amd64 to {amd64 linux  [] }
DEBU[0000] Trying "docker.io/library/busybox:latest" ...
DEBU[0000] parsed reference into "[overlay@/home/podman/.local/share/containers/storage+/tmp/storage-run-1000/containers]@65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac"
DEBU[0000] Found image "docker.io/library/busybox:latest" as "docker.io/library/busybox:latest" in local containers storage
DEBU[0000] Found image "docker.io/library/busybox:latest" as "docker.io/library/busybox:latest" in local containers storage ([overlay@/home/podman/.local/share/containers/storage+/tmp/storage-run-1000/containers]@65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac)
DEBU[0000] exporting opaque data as blob "sha256:65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac"
DEBU[0000] Inspecting image 65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac
DEBU[0000] exporting opaque data as blob "sha256:65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac"
DEBU[0000] Inspecting image 65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac
DEBU[0000] Inspecting image 65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac
DEBU[0000] Inspecting image 65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac
DEBU[0000] User mount /proc:/proc options []
DEBU[0000] using systemd mode: false
DEBU[0000] Loading seccomp profile from "/usr/share/containers/seccomp.json"
DEBU[0000] Adding mount /dev
DEBU[0000] Adding mount /dev/pts
DEBU[0000] Adding mount /sys
DEBU[0000] Adding mount /dev/mqueue
DEBU[0000] Adding mount /sys/fs/cgroup
DEBU[0000] Successfully loaded 1 networks
DEBU[0000] Allocated lock 6 for container 7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce
DEBU[0000] exporting opaque data as blob "sha256:65ad0d468eb1c558bf7f4e64e790f586e9eda649ee9f130cd0e835b292bbc5ac"
DEBU[0000] Created container "7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce"
DEBU[0000] Container "7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce" has work directory "/home/podman/.local/share/containers/storage/overlay-containers/7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce/userdata"
DEBU[0000] Container "7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce" has run directory "/tmp/storage-run-1000/containers/overlay-containers/7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce/userdata"
DEBU[0000] Not attaching to stdin
INFO[0000] Received shutdown.Stop(), terminating!        PID=515
DEBU[0000] Enabling signal proxying
DEBU[0000] overlay: mount_data=lowerdir=/home/podman/.local/share/containers/storage/overlay/l/F2QTOXZ7YEPRV74Z6VWO6OUOQI,upperdir=/home/podman/.local/share/containers/storage/overlay/6305a866e2f9015f41a9c1396df7b64f4ca8c01b66761985c947843066b56124/diff,workdir=/home/podman/.local/share/containers/storage/overlay/6305a866e2f9015f41a9c1396df7b64f4ca8c01b66761985c947843066b56124/work
DEBU[0000] Made network namespace at /tmp/storage-run-1000/netns/netns-894216f4-72d2-876f-07ba-9d4d7ee50dfe for container 7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce
DEBU[0000] Mounted container "7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce" at "/home/podman/.local/share/containers/storage/overlay/6305a866e2f9015f41a9c1396df7b64f4ca8c01b66761985c947843066b56124/merged"
DEBU[0000] Created root filesystem for container 7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce at /home/podman/.local/share/containers/storage/overlay/6305a866e2f9015f41a9c1396df7b64f4ca8c01b66761985c947843066b56124/merged
TRAC[0000] netavark command: printf '{"container_id":"7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce","container_name":"laughing_borg","networks":{"podman":{"static_ips":["10.88.0.8"],"aliases":["7713c1bd0dcf"],"interface_name":"eth0"}},"network_info":{"podman":{"name":"podman","id":"2f259bab93aaaaa2542ba43ef33eb990d0999ee1b9924b557b7be53c0b7a1bb9","driver":"bridge","network_interface":"podman0","created":"2024-06-07T16:57:43.104440601Z","subnets":[{"subnet":"10.88.0.0/16","gateway":"10.88.0.1"}],"ipv6_enabled":false,"internal":false,"dns_enabled":false,"ipam_options":{"driver":"host-local"}}}}' | /usr/libexec/podman/netavark setup /tmp/storage-run-1000/netns/netns-894216f4-72d2-876f-07ba-9d4d7ee50dfe
DEBU[0000] Creating rootless network namespace at "/tmp/storage-run-1000/containers/networks/rootless-netns/rootless-netns"
DEBU[0000] pasta arguments: --config-net --pid /tmp/storage-run-1000/containers/networks/rootless-netns/rootless-netns-conn.pid --dns-forward 169.254.0.1 -t none -u none -T none -U none --no-map-gw --quiet --netns /tmp/storage-run-1000/containers/networks/rootless-netns/rootless-netns
DEBU[0000] The path of /etc/resolv.conf in the mount ns is "/etc/resolv.conf"
[DEBUG netavark::network::validation] "Validating network namespace..."
[DEBUG netavark::commands::setup] "Setting up..."
[DEBUG netavark::firewall] Forcibly using firewall driver nftables
[INFO  netavark::firewall] Using nftables firewall driver
[TRACE netavark::network::netlink] send netlink packet: NetlinkMessage { header: NetlinkHeader { length: 32, message_type: 19, flags: 1541, sequence_number: 1, port_number: 0 }, payload: InnerMessage(SetLink(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 1, link_layer_type: Netrom, flags: [Up], change_mask: [Up] }, attributes: [] })) }
[TRACE netavark::network::netlink] read netlink packet: NetlinkMessage { header: NetlinkHeader { length: 36, message_type: 2, flags: 256, sequence_number: 1, port_number: 546 }, payload: Error(ErrorMessage { code: None, header: [32, 0, 0, 0, 19, 0, 5, 6, 1, 0, 0, 0, 0, 0, 0, 0] }) }
[DEBUG netavark::network::bridge] Setup network podman
[DEBUG netavark::network::bridge] Container interface name: eth0 with IP addresses [10.88.0.8/16]
[DEBUG netavark::network::bridge] Bridge name: podman0 with IP addresses [10.88.0.1/16]
[DEBUG netavark::network::core_utils] Setting sysctl value for net.ipv4.ip_forward to 1
[TRACE netavark::network::netlink] send netlink packet: NetlinkMessage { header: NetlinkHeader { length: 44, message_type: 18, flags: 1, sequence_number: 1, port_number: 0 }, payload: InnerMessage(GetLink(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 0, link_layer_type: Netrom, flags: [], change_mask: [] }, attributes: [IfName("podman0")] })) }
[TRACE netavark::network::netlink] read netlink packet: NetlinkMessage { header: NetlinkHeader { length: 64, message_type: 2, flags: 0, sequence_number: 1, port_number: 546 }, payload: Error(ErrorMessage { code: Some(-19), header: [44, 0, 0, 0, 18, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 0, 3, 0, 112, 111, 100, 109, 97, 110, 48, 0] }) }
[TRACE netavark::network::netlink] send netlink packet: NetlinkMessage { header: NetlinkHeader { length: 60, message_type: 16, flags: 1541, sequence_number: 2, port_number: 0 }, payload: InnerMessage(NewLink(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 0, link_layer_type: Netrom, flags: [], change_mask: [] }, attributes: [LinkInfo([Kind(Bridge)]), IfName("podman0")] })) }
[TRACE netavark::network::netlink] read netlink packet: NetlinkMessage { header: NetlinkHeader { length: 36, message_type: 2, flags: 256, sequence_number: 2, port_number: 546 }, payload: Error(ErrorMessage { code: None, header: [60, 0, 0, 0, 16, 0, 5, 6, 2, 0, 0, 0, 0, 0, 0, 0] }) }
[TRACE netavark::network::netlink] send netlink packet: NetlinkMessage { header: NetlinkHeader { length: 44, message_type: 18, flags: 1, sequence_number: 3, port_number: 0 }, payload: InnerMessage(GetLink(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 0, link_layer_type: Netrom, flags: [], change_mask: [] }, attributes: [IfName("podman0")] })) }
[TRACE netavark::network::netlink] read netlink packet: NetlinkMessage { header: NetlinkHeader { length: 1880, message_type: 16, flags: 0, sequence_number: 3, port_number: 546 }, payload: InnerMessage(NewLink(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 3, link_layer_type: Ether, flags: [Broadcast, Multicast], change_mask: [] }, attributes: [IfName("podman0"), TxQueueLen(1000), OperState(Down), Mode(0), Mtu(1500), MinMtu(68), MaxMtu(65535), Group(0), Promiscuity(0), Other(DefaultNla { kind: 61, value: [0, 0, 0, 0] }), NumTxQueues(1), GsoMaxSegs(65535), GsoMaxSize(65536), Other(DefaultNla { kind: 58, value: [0, 0, 1, 0] }), Other(DefaultNla { kind: 63, value: [0, 0, 1, 0] }), Other(DefaultNla { kind: 64, value: [0, 0, 1, 0] }), Other(DefaultNla { kind: 59, value: [0, 0, 1, 0] }), Other(DefaultNla { kind: 60, value: [255, 255, 0, 0] }), NumRxQueues(1), Carrier(1), Qdisc("noop"), CarrierChanges(0), CarrierUpCount(0), CarrierDownCount(0), ProtoDown(0), Map(Map { memory_start: 0, memory_end: 0, base_address: 0, irq: 0, dma: 0, port: 0 }), Address([186, 217, 241, 171, 224, 26]), Broadcast([255, 255, 255, 255, 255, 255]), Stats64(Stats64 { rx_packets: 0, tx_packets: 0, rx_bytes: 0, tx_bytes: 0, rx_errors: 0, tx_errors: 0, rx_dropped: 0, tx_dropped: 0, multicast: 0, collisions: 0, rx_length_errors: 0, rx_over_errors: 0, rx_crc_errors: 0, rx_frame_errors: 0, rx_fifo_errors: 0, rx_missed_errors: 0, tx_aborted_errors: 0, tx_carrier_errors: 0, tx_fifo_errors: 0, tx_heartbeat_errors: 0, tx_window_errors: 0, rx_compressed: 0, tx_compressed: 0, rx_nohandler: 0, rx_otherhost_dropped: 0 }), Stats(Stats { rx_packets: 0, tx_packets: 0, rx_bytes: 0, tx_bytes: 0, rx_errors: 0, tx_errors: 0, rx_dropped: 0, tx_dropped: 0, multicast: 0, collisions: 0, rx_length_errors: 0, rx_over_errors: 0, rx_crc_errors: 0, rx_frame_errors: 0, rx_fifo_errors: 0, rx_missed_errors: 0, tx_aborted_errors: 0, tx_carrier_errors: 0, tx_fifo_errors: 0, tx_heartbeat_errors: 0, tx_window_errors: 0, rx_compressed: 0, tx_compressed: 0, rx_nohandler: 0 }), Xdp([Attached(None)]), LinkInfo([Kind(Bridge), Data(Bridge([HelloTimer(0), TcnTimer(0), TopologyChangeTimer(0), GcTimer(0), ForwardDelay(1499), HelloTime(199), MaxAge(1999), AgeingTime(29999), StpState(0), Priority(32768), VlanFiltering(0), GroupFwdMask(0), BridgeId((128, [0, 0, 0, 0, 0, 0])), RootId((128, [0, 0, 0, 0, 0, 0])), RootPort(0), RootPathCost(0), TopologyChange(0), TopologyChangeDetected(0), GroupAddr([1, 128, 194, 0, 0, 0]), MultiBoolOpt(30064771072), Other(DefaultNla { kind: 48, value: [0, 0, 0, 0] }), Other(DefaultNla { kind: 49, value: [0, 0, 0, 0] }), VlanProtocol(33024), VlanDefaultPvid(1), VlanStatsEnabled(0), VlanStatsPerHost(0), MulticastRouter(1), MulticastSnooping(1), MulticastQueryUseIfaddr(0), MulticastQuerier(0), MulticastStatsEnabled(0), MulticastHashElasticity(16), MulticastHashMax(4096), MulticastLastMemberCount(2), MulticastStartupQueryCount(2), MulticastIgmpVersion(2), MulticastMldVersion(1), MulticastLastMemberInterval(99), MulticastMembershipInterval(25999), MulticastQuerierInterval(25499), MulticastQueryInterval(12499), MulticastQueryResponseInterval(999), MulticastStartupQueryInterval(3124), NfCallIpTables(0), NfCallIp6Tables(0), NfCallArpTables(0)]))]), AfSpecUnspec([Inet([DevConf(InetDevConf { forwarding: 1, mc_forwarding: 0, proxy_arp: 0, accept_redirects: 1, secure_redirects: 1, send_redirects: 1, shared_media: 1, rp_filter: 2, accept_source_route: 0, bootp_relay: 0, log_martians: 0, tag: 0, arpfilter: 0, medium_id: 0, noxfrm: 0, nopolicy: 0, force_igmp_version: 0, arp_announce: 0, arp_ignore: 0, promote_secondaries: 1, arp_accept: 0, arp_notify: 0, accept_local: 0, src_vmark: 0, proxy_arp_pvlan: 0, route_localnet: 0, igmpv2_unsolicited_report_interval: 10000, igmpv3_unsolicited_report_interval: 1000, ignore_routes_with_linkdown: 0, drop_unicast_in_l2_multicast: 0, drop_gratuitous_arp: 0, bc_forwarding: 0, arp_evict_nocarrier: 1 })]), Inet6([Flags(Inet6IfaceFlags([])), CacheInfo(Inet6CacheInfo { max_reasm_len: 65535, tstamp: 209915975, reachable_time: 25374, retrans_time: 1000 }), DevConf(Inet6DevConf { forwarding: 0, hoplimit: 64, mtu6: 1500, accept_ra: 1, accept_redirects: 1, autoconf: 1, dad_transmits: 1, rtr_solicits: -1, rtr_solicit_interval: 4000, rtr_solicit_delay: 1000, use_tempaddr: 0, temp_valid_lft: 604800, temp_prefered_lft: 86400, regen_max_retry: 3, max_desync_factor: 600, max_addresses: 16, force_mld_version: 0, accept_ra_defrtr: 1, accept_ra_pinfo: 1, accept_ra_rtr_pref: 1, rtr_probe_interval: 60000, accept_ra_rt_info_max_plen: 0, proxy_ndp: 0, optimistic_dad: 0, accept_source_route: 0, mc_forwarding: 0, disable_ipv6: 0, accept_dad: 1, force_tllao: 0, ndisc_notify: 0, mldv1_unsolicited_report_interval: 10000, mldv2_unsolicited_report_interval: 1000, suppress_frag_ndisc: 1, accept_ra_from_local: 0, use_optimistic: 0, accept_ra_mtu: 1, stable_secret: 0, use_oif_addrs_only: 0, accept_ra_min_hop_limit: 1, ignore_routes_with_linkdown: 0, drop_unicast_in_l2_multicast: 0, drop_unsolicited_na: 0, keep_addr_on_down: 0, rtr_solicit_max_interval: 3600000, seg6_enabled: 0, seg6_require_hmac: 0, enhanced_dad: 1, addr_gen_mode: 0, disable_policy: 0, accept_ra_rt_info_min_plen: 0, ndisc_tclass: 0, rpl_seg_enabled: 0, ra_defrtr_metric: 1024, ioam6_enabled: 0, ioam6_id: 65535, ioam6_id_wide: -1, ndisc_evict_nocarrier: 1, accept_untracked_na: 0, accept_ra_min_lft: 0 }), Stats(Inet6Stats { num: 38, in_pkts: 0, in_octets: 0, in_delivers: 0, out_forw_datagrams: 0, out_pkts: 0, out_octets: 0, in_hdr_errors: 0, in_too_big_errors: 0, in_no_routes: 0, in_addr_errors: 0, in_unknown_protos: 0, in_truncated_pkts: 0, in_discards: 0, out_discards: 0, out_no_routes: 0, reasm_timeout: 0, reasm_reqds: 0, reasm_oks: 0, reasm_fails: 0, frag_oks: 0, frag_fails: 0, frag_creates: 0, in_mcast_pkts: 0, out_mcast_pkts: 0, in_bcast_pkts: 0, out_bcast_pkts: 0, in_mcast_octets: 0, out_mcast_octets: 0, in_bcast_octets: 0, out_bcast_octets: 0, in_csum_errors: 0, in_no_ect_pkts: 0, in_ect1_pkts: 0, in_ect0_pkts: 0, in_ce_pkts: 0 }), Icmp6Stats(Icmp6Stats { num: 7, in_msgs: 0, in_errors: 0, out_msgs: 0, out_errors: 0, csum_errors: 0 }), Token(::), AddrGenMode(0)])]), Other(DefaultNla { kind: 32830, value: [] }), Other(DefaultNla { kind: 32833, value: [] })] })) }
[TRACE netavark::network::netlink] send netlink packet: NetlinkMessage { header: NetlinkHeader { length: 40, message_type: 20, flags: 1541, sequence_number: 4, port_number: 0 }, payload: InnerMessage(NewAddress(AddressMessage { header: AddressHeader { family: Inet, prefix_len: 16, flags: [], scope: Universe, index: 3 }, attributes: [Broadcast(10.88.255.255), Local(10.88.0.1)] })) }
[TRACE netavark::network::netlink] read netlink packet: NetlinkMessage { header: NetlinkHeader { length: 36, message_type: 2, flags: 256, sequence_number: 4, port_number: 546 }, payload: Error(ErrorMessage { code: None, header: [40, 0, 0, 0, 20, 0, 5, 6, 4, 0, 0, 0, 0, 0, 0, 0] }) }
[TRACE netavark::network::netlink] send netlink packet: NetlinkMessage { header: NetlinkHeader { length: 32, message_type: 19, flags: 1541, sequence_number: 5, port_number: 0 }, payload: InnerMessage(SetLink(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 3, link_layer_type: Netrom, flags: [Up], change_mask: [Up] }, attributes: [] })) }
[TRACE netavark::network::netlink] read netlink packet: NetlinkMessage { header: NetlinkHeader { length: 36, message_type: 2, flags: 256, sequence_number: 5, port_number: 546 }, payload: Error(ErrorMessage { code: None, header: [32, 0, 0, 0, 19, 0, 5, 6, 5, 0, 0, 0, 0, 0, 0, 0] }) }
[TRACE netavark::network::netlink] send netlink packet: NetlinkMessage { header: NetlinkHeader { length: 116, message_type: 16, flags: 1541, sequence_number: 6, port_number: 0 }, payload: InnerMessage(NewLink(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 0, link_layer_type: Netrom, flags: [], change_mask: [] }, attributes: [LinkInfo([Kind(Veth), Data(Veth(Peer(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 0, link_layer_type: Netrom, flags: [], change_mask: [] }, attributes: [LinkInfo([Kind(Veth)]), IfName("eth0"), NetNsFd(3)] })))]), Controller(3)] })) }
[TRACE netavark::network::netlink] read netlink packet: NetlinkMessage { header: NetlinkHeader { length: 36, message_type: 2, flags: 256, sequence_number: 6, port_number: 546 }, payload: Error(ErrorMessage { code: None, header: [116, 0, 0, 0, 16, 0, 5, 6, 6, 0, 0, 0, 0, 0, 0, 0] }) }
[TRACE netavark::network::netlink] send netlink packet: NetlinkMessage { header: NetlinkHeader { length: 44, message_type: 18, flags: 1, sequence_number: 2, port_number: 0 }, payload: InnerMessage(GetLink(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 0, link_layer_type: Netrom, flags: [], change_mask: [] }, attributes: [IfName("eth0")] })) }
[TRACE netavark::network::netlink] read netlink packet: NetlinkMessage { header: NetlinkHeader { length: 1468, message_type: 16, flags: 0, sequence_number: 2, port_number: 546 }, payload: InnerMessage(NewLink(LinkMessage { header: LinkHeader { interface_family: Unspec, index: 2, link_layer_type: Ether, flags: [Broadcast, Multicast], change_mask: [] }, attributes: [IfName("eth0"), TxQueueLen(1000), OperState(Down), Mode(0), Mtu(1500), MinMtu(68), MaxMtu(65535), Group(0), Promiscuity(0), Other(DefaultNla { kind: 61, value: [0, 0, 0, 0] }), NumTxQueues(64), GsoMaxSegs(65535), GsoMaxSize(65536), Other(DefaultNla { kind: 58, value: [0, 0, 1, 0] }), Other(DefaultNla { kind: 63, value: [0, 0, 1, 0] }), Other(DefaultNla { kind: 64, value: [0, 0, 1, 0] }), Other(DefaultNla { kind: 59, value: [248, 255, 7, 0] }), Other(DefaultNla { kind: 60, value: [255, 255, 0, 0] }), NumRxQueues(64), Carrier(0), Qdisc("noop"), CarrierChanges(1), CarrierUpCount(0), CarrierDownCount(1), ProtoDown(0), Map(Map { memory_start: 0, memory_end: 0, base_address: 0, irq: 0, dma: 0, port: 0 }), Address([70, 53, 181, 98, 53, 139]), Broadcast([255, 255, 255, 255, 255, 255]), Stats64(Stats64 { rx_packets: 0, tx_packets: 0, rx_bytes: 0, tx_bytes: 0, rx_errors: 0, tx_errors: 0, rx_dropped: 0, tx_dropped: 0, multicast: 0, collisions: 0, rx_length_errors: 0, rx_over_errors: 0, rx_crc_errors: 0, rx_frame_errors: 0, rx_fifo_errors: 0, rx_missed_errors: 0, tx_aborted_errors: 0, tx_carrier_errors: 0, tx_fifo_errors: 0, tx_heartbeat_errors: 0, tx_window_errors: 0, rx_compressed: 0, tx_compressed: 0, rx_nohandler: 0, rx_otherhost_dropped: 0 }), Stats(Stats { rx_packets: 0, tx_packets: 0, rx_bytes: 0, tx_bytes: 0, rx_errors: 0, tx_errors: 0, rx_dropped: 0, tx_dropped: 0, multicast: 0, collisions: 0, rx_length_errors: 0, rx_over_errors: 0, rx_crc_errors: 0, rx_frame_errors: 0, rx_fifo_errors: 0, rx_missed_errors: 0, tx_aborted_errors: 0, tx_carrier_errors: 0, tx_fifo_errors: 0, tx_heartbeat_errors: 0, tx_window_errors: 0, rx_compressed: 0, tx_compressed: 0, rx_nohandler: 0 }), Xdp([Attached(None)]), LinkInfo([Kind(Veth)]), NetnsId(0), Link(4), AfSpecUnspec([Inet([DevConf(InetDevConf { forwarding: 1, mc_forwarding: 0, proxy_arp: 0, accept_redirects: 1, secure_redirects: 1, send_redirects: 1, shared_media: 1, rp_filter: 2, accept_source_route: 0, bootp_relay: 0, log_martians: 0, tag: 0, arpfilter: 0, medium_id: 0, noxfrm: 0, nopolicy: 0, force_igmp_version: 0, arp_announce: 0, arp_ignore: 0, promote_secondaries: 1, arp_accept: 0, arp_notify: 0, accept_local: 0, src_vmark: 0, proxy_arp_pvlan: 0, route_localnet: 0, igmpv2_unsolicited_report_interval: 10000, igmpv3_unsolicited_report_interval: 1000, ignore_routes_with_linkdown: 0, drop_unicast_in_l2_multicast: 0, drop_gratuitous_arp: 0, bc_forwarding: 0, arp_evict_nocarrier: 1 })]), Inet6([Flags(Inet6IfaceFlags([])), CacheInfo(Inet6CacheInfo { max_reasm_len: 65535, tstamp: 209915976, reachable_time: 25047, retrans_time: 1000 }), DevConf(Inet6DevConf { forwarding: 0, hoplimit: 64, mtu6: 1500, accept_ra: 1, accept_redirects: 1, autoconf: 1, dad_transmits: 1, rtr_solicits: -1, rtr_solicit_interval: 4000, rtr_solicit_delay: 1000, use_tempaddr: 0, temp_valid_lft: 604800, temp_prefered_lft: 86400, regen_max_retry: 3, max_desync_factor: 600, max_addresses: 16, force_mld_version: 0, accept_ra_defrtr: 1, accept_ra_pinfo: 1, accept_ra_rtr_pref: 1, rtr_probe_interval: 60000, accept_ra_rt_info_max_plen: 0, proxy_ndp: 0, optimistic_dad: 0, accept_source_route: 0, mc_forwarding: 0, disable_ipv6: 0, accept_dad: 1, force_tllao: 0, ndisc_notify: 0, mldv1_unsolicited_report_interval: 10000, mldv2_unsolicited_report_interval: 1000, suppress_frag_ndisc: 1, accept_ra_from_local: 0, use_optimistic: 0, accept_ra_mtu: 1, stable_secret: 0, use_oif_addrs_only: 0, accept_ra_min_hop_limit: 1, ignore_routes_with_linkdown: 0, drop_unicast_in_l2_multicast: 0, drop_unsolicited_na: 0, keep_addr_on_down: 0, rtr_solicit_max_interval: 3600000, seg6_enabled: 0, seg6_require_hmac: 0, enhanced_dad: 1, addr_gen_mode: 0, disable_policy: 0, accept_ra_rt_info_min_plen: 0, ndisc_tclass: 0, rpl_seg_enabled: 0, ra_defrtr_metric: 1024, ioam6_enabled: 0, ioam6_id: 65535, ioam6_id_wide: -1, ndisc_evict_nocarrier: 1, accept_untracked_na: 0, accept_ra_min_lft: 0 }), Stats(Inet6Stats { num: 38, in_pkts: 0, in_octets: 0, in_delivers: 0, out_forw_datagrams: 0, out_pkts: 0, out_octets: 0, in_hdr_errors: 0, in_too_big_errors: 0, in_no_routes: 0, in_addr_errors: 0, in_unknown_protos: 0, in_truncated_pkts: 0, in_discards: 0, out_discards: 0, out_no_routes: 0, reasm_timeout: 0, reasm_reqds: 0, reasm_oks: 0, reasm_fails: 0, frag_oks: 0, frag_fails: 0, frag_creates: 0, in_mcast_pkts: 0, out_mcast_pkts: 0, in_bcast_pkts: 0, out_bcast_pkts: 0, in_mcast_octets: 0, out_mcast_octets: 0, in_bcast_octets: 0, out_bcast_octets: 0, in_csum_errors: 0, in_no_ect_pkts: 0, in_ect1_pkts: 0, in_ect0_pkts: 0, in_ce_pkts: 0 }), Icmp6Stats(Icmp6Stats { num: 7, in_msgs: 0, in_errors: 0, out_msgs: 0, out_errors: 0, csum_errors: 0 }), Token(::), AddrGenMode(0)])]), Other(DefaultNla { kind: 32830, value: [] }), Other(DefaultNla { kind: 32833, value: [] })] })) }
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv6/conf/eth0/autoconf to 0
[DEBUG netavark::network::core_utils] Setting sysctl value for /proc/sys/net/ipv4/conf/eth0/arp_notify to 1
DEBU[0000] Cleaning up rootless network namespace
DEBU[0000] Unmounted container "7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce"
DEBU[0000] Network is already cleaned up, skipping...
DEBU[0000] Cleaning up container 7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce
DEBU[0000] Network is already cleaned up, skipping...
DEBU[0000] Container 7713c1bd0dcf26b7359a107e344b7f4f8564fb9c27d3f958a6f51c18707453ce storage is already unmounted, skipping...
DEBU[0000] ExitCode msg: "netavark (exit code 1): sysctl error: io error: read-only file system (os error 30)"
Error: netavark (exit code 1): Sysctl error: IO Error: Read-only file system (os error 30)
DEBU[0000] Shutting down engines

[1]: https://www.redhat.com/sysadmin/podman-inside-kubernetes
[2]: https://www.redhat.com/sysadmin/podman-inside-container
[3]: https://devconfcz2023.sched.com/event/1MYld/root-is-less-container-networks-get-in-shape-with-pasta
[4]: https://github.com/containers/podman/blob/main/docs/tutorials/basic_networking.md
[5]: #825 (comment)
[6]:

pub fn apply_sysctl_value(

EDIT: the above was run on Kubernetes 1.29.4 with node kernel 6.8.9 and podman/stable:v5.0.3.

@mvalvekensCET
Copy link

mvalvekensCET commented Sep 11, 2024

Hi, I'm facing some version of @Omar007's issue as well. With procMount: Unmasked in the container security context, I was able to get a working bridge network in rootless podman inside k8s with minimal friction (podman 4.9.3).

However, as soon as I attempted to expose one of the container ports on localhost, the container failed to come up as netavark tries to set net.ipv4.conf.<interface>.route_localnet=1 and it can't write to /proc in the rootless netns. The issue goes away if I make the podman container a (rootless) privileged one.

My questions:

  • I understand why this happens, but are there workarounds? Or is it in principle possible to wire that kind of port forwarding in a different way? Performance is not really a concern, FWIW (see below).
  • I tried running podman unshare --rootless-netns sysctl -w net.ipv4.conf.default.route_localnet=1 in a privileged init container with the same UID as my main one to "preset" the sysctl value (even tried it while persisting the contents of $XDG_RUNTIME_DIR across both containers), but the sysctl change did not survive in the main container. I don't know enough about namespacing behaviour to judge whether that's unavoidable or not--IIRC this kind of trick does work for making changes to sysctls in the k8s pod's own namespace.

If you think it's meaningful to do so I can make another reproduction attempt with a more recent podman version (the only reason why I tested with 4.9.3 was because it's part of another image in our setup).

Here's some context on what we're even trying to achieve here (X/Y problems and all that): we have a bunch of code using testcontainers for integration tests that currently run on EC2 instances with Docker. We'd like to "lift and shift" all of that into our new CI system that will (likely) only deal with runners in Kubernetes, while following the principle of least privilege as much as possible. Running rootless podman in k8s is one of the avenues we're exploring. Almost every single one of these test setups currently uses Docker's "built-in" port forwarding as its main means of communicating with the containerised services in the test. At the same time, the services also communicate with one another, so just putting them all behind slirp4netns and skipping the bridge network is not really an option => we'd like to have the containers in a bridge network and have ports forwarded to the k8s pod's localhost so the test runner can communicate with them. So far, I haven't found a way to achieve that in rootless mode without privileged: true, hence this post.

@Omar007
Copy link

Omar007 commented Sep 16, 2024

@mvalvekensCET how are you using procMount: Unmasked? Could you share a working spec for that? Or are you by chance on a k8s version <= 1.29?
The issue I run into when trying to set it to unmasked is that this also requires the use of user namespaces (spec.hostUsers: false, not enforced in k8s <=1.29 but since my last post moved to 1.30). That in turn makes both /dev/fuse as well as /dev/net/tun, which are mounted using a device plugin, inaccessible as things like that are not (yet) compatible with user namespaces in k8s. And ofc. that will very much prevent things from working ;)

@dennybaa
Copy link

dennybaa commented Jan 11, 2025

@Omar007 Unfortunately it's difficult to try on new GKE, since it has no this feature gate enabled, so spec.hostUsers: false is useless.
Is that what you're saying when it's possible to switch to user namespace, this will render /dev/fuse, /dev/net/tun useless ? Can you please explain, is it really needed for netavarak and pasta networking though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants