Skip to content

Add missing Linux capability checks for SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority#12872

Open
petrmarinec wants to merge 4 commits intogoogle:masterfrom
petrmarinec:fix/missing-capability-checks
Open

Add missing Linux capability checks for SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority#12872
petrmarinec wants to merge 4 commits intogoogle:masterfrom
petrmarinec:fix/missing-capability-checks

Conversation

@petrmarinec
Copy link
Copy Markdown

@petrmarinec petrmarinec commented Apr 5, 2026

Summary

This patch adds capability and permission checks that the Linux kernel enforces but gVisor currently omits. Each fix was verified against native Linux behavior using bazel test on both native and runsc_ptrace platforms.

Changes

1. SO_BINDTODEVICE: Add CAP_NET_RAW check

File: pkg/sentry/socket/netstack/netstack.go
Linux reference: net/core/sock.c:sock_setsockopt() checks ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW)
Evidence this is unintended: gVisor's own test suite asserts "CAP_NET_RAW is required to use SO_BINDTODEVICE" (test/syscalls/linux/socket_bind_to_device.cc:52), and SO_RCVBUFFORCE in the same file already correctly checks CAP_NET_ADMIN.

2. mknod(S_IFBLK/S_IFCHR): Add CAP_MKNOD check

File: pkg/sentry/syscalls/linux/sys_file.go
Linux reference: fs/namei.c:vfs_mknod() checks capable(CAP_MKNOD) for block/char device creation
Evidence this is unintended: CAP_MKNOD is defined (pkg/abi/linux/capability.go:56), parsed from OCI specs (runsc/specutils/specutils.go:491), and has strace formatting — but is never checked anywhere. Zero HasCapability calls for it exist in the codebase.

3. sched_setaffinity: Add UID match / CAP_SYS_NICE check

File: pkg/sentry/syscalls/linux/sys_thread.go
Linux reference: kernel/sched/core.c:check_same_owner() requires EUID match or CAP_SYS_NICE
Impact: Without this check, any unprivileged process could modify another process's CPU affinity mask.

4. setpriority: Add UID match / CAP_SYS_NICE check

File: pkg/sentry/syscalls/linux/sys_thread.go
Linux reference: kernel/sys.c:set_one_prio() requires UID match or CAP_SYS_NICE
Impact: Without this check, any unprivileged process could change another process's scheduling priority.

Testing

Tests added in test/syscalls/linux/capability_checks.cc, verified on both native Linux and gVisor:

bazel test //test/syscalls:capability_checks_test_native        → 6/6 passed
bazel test //test/syscalls:capability_checks_test_runsc_ptrace  → 4 passed, 2 skipped

The 2 skipped tests are the mknod positive cases (creating device nodes with CAP_MKNOD), which are skipped on gVisor because the sandbox does not permit device node creation regardless of capabilities.

Test What it verifies
SoBindToDeviceCapTest.RequiresCapNetRaw EPERM without CAP_NET_RAW
MknodCapTest.CharDevRequiresCapMknod EPERM for S_IFCHR without CAP_MKNOD (native only)
MknodCapTest.BlockDevRequiresCapMknod EPERM for S_IFBLK without CAP_MKNOD (native only)
MknodCapTest.FifoDoesNotRequireCapMknod S_IFIFO succeeds without CAP_MKNOD
SchedSetaffinityCapTest.OtherUidRequiresCapSysNice EPERM without UID match or CAP_SYS_NICE
SetpriorityCapTest.OtherUidRequiresCapSysNice EPERM without UID match or CAP_SYS_NICE

@google-cla
Copy link
Copy Markdown

google-cla bot commented Apr 5, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

This patch adds capability checks that the Linux kernel enforces but
gVisor currently omits. Each fix matches the corresponding Linux kernel
check, using patterns already established in the gVisor codebase (e.g.,
keys.go correctly checks CAP_SYS_ADMIN for /proc/sys/kernel/keys/maxkeys).

Fixes:

1. SO_BINDTODEVICE: Add CAP_NET_RAW check (net/core/sock.c).
   gVisor's own tests assert "CAP_NET_RAW is required to use
   SO_BINDTODEVICE" but the sentry never enforced it.

2. mknod(S_IFBLK/S_IFCHR): Add CAP_MKNOD check (fs/namei.c:vfs_mknod).
   CAP_MKNOD is defined in the codebase and parsed from OCI specs but
   was never checked. Unprivileged processes could create device nodes
   on tmpfs.

3. /proc/sys/net/ipv4/ sysctls: Add CAP_NET_ADMIN checks for tcp_sack,
   tcp_recovery, tcp_rmem, tcp_wmem, and ip_local_port_range
   (net/sysctl_net.c:net_ctl_permissions).

4. /proc/sys/fs/nr_open: Add CAP_SYS_ADMIN check
   (kernel/sysctl.c sysctl_perm).

5. sched_setaffinity: Add UID match / CAP_SYS_NICE check
   (kernel/sched/core.c:check_same_owner). Any unprivileged process
   could modify another process's CPU affinity.

6. setpriority: Add UID match / CAP_SYS_NICE check
   (kernel/sys.c:set_one_prio). Any unprivileged process could change
   another process's scheduling priority.
@petrmarinec petrmarinec force-pushed the fix/missing-capability-checks branch from bfa76ad to 9cab654 Compare April 5, 2026 06:55
Copy link
Copy Markdown
Collaborator

@EtiennePerot EtiennePerot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add syscall tests under test/syscalls/linux to exercise these and to ensure consistency with Linux?

Add test/syscalls/linux/capability_checks.cc with tests that verify
Linux-compatible capability enforcement for each fix in this PR:

- SO_BINDTODEVICE: Verify EPERM without CAP_NET_RAW
- mknod(S_IFCHR/S_IFBLK): Verify EPERM without CAP_MKNOD
- mknod(S_IFIFO): Verify no capability required (negative test)
- /proc/sys/net/ipv4/tcp_sack: Verify EPERM without CAP_NET_ADMIN
- /proc/sys/net/ipv4/ip_local_port_range: Verify EPERM without CAP_NET_ADMIN
- /proc/sys/fs/nr_open: Verify EPERM without CAP_SYS_ADMIN
- sched_setaffinity(other_pid): Verify EPERM without CAP_SYS_NICE
- setpriority(other_pid): Verify EPERM without CAP_SYS_NICE

Each test uses AutoCapability to drop the relevant capability, then
asserts the syscall returns EPERM, matching Linux kernel behavior.
- Remove nonexistent test_main.h include
- Fix sched_setaffinity and setpriority tests: fork a child with
  a different UID (nobody/65534) instead of targeting PID 1, since
  PID 1 shares UID 0 with the test and would bypass the UID check
- Switch proc/sys tests to read/lseek/write for consistency
- Add sys/wait.h for waitpid
…ecks

Changes after testing on native Linux (bazel test) and gVisor (runsc_ptrace):

1. SO_BINDTODEVICE: Fix build error - s.HasCapability() does not exist on
   the socket.Socket interface. Changed to t.HasCapabilityIn() scoped to
   the network namespace's user namespace, matching gVisor's existing
   capability checking pattern.

2. Remove /proc/sys capability checks: After testing on native Linux,
   writing to tcp_sack, tcp_recovery, tcp_rmem, tcp_wmem,
   ip_local_port_range, and nr_open succeeded even after dropping
   CAP_NET_ADMIN / CAP_SYS_ADMIN. These checks did not match actual
   Linux behavior, so they are removed to keep the PR aligned with its
   goal of matching Linux.

3. Tests: Remove proc/sys tests that did not hold on native Linux. Add
   IsRunningOnGvisor() skip for mknod positive cases since the sandbox
   does not permit creating device nodes regardless of capabilities.

Tested:
  - bazel test //test/syscalls:capability_checks_test_native (6/6 passed)
  - bazel test //test/syscalls:capability_checks_test_runsc_ptrace
    (4 passed, 2 skipped as expected)
@petrmarinec petrmarinec changed the title Add missing Linux capability checks across multiple subsystems Add missing Linux capability checks for SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority Apr 6, 2026
@petrmarinec
Copy link
Copy Markdown
Author

Tests added in test/syscalls/linux/capability_checks.cc. I also verified the changes against native Linux and gVisor (runsc_ptrace) using Bazel:

bazel test //test/syscalls:capability_checks_test_native        → 6/6 passed
bazel test //test/syscalls:capability_checks_test_runsc_ptrace  → 4 passed, 2 skipped

During testing I found and fixed a few issues:

  1. Build fix: s.HasCapability() doesn't exist on the socket.Socket interface in SetSockOptSocket. Changed to t.HasCapabilityIn(linux.CAP_NET_RAW, t.NetworkNamespace().UserNamespace()).

  2. Removed /proc/sys capability checks: After testing on native Linux, writing to tcp_sack, tcp_recovery, tcp_rmem/wmem, ip_local_port_range, and nr_open succeeded even after dropping CAP_NET_ADMIN/CAP_SYS_ADMIN. These checks did not match actual Linux behavior, so I removed them to keep the PR aligned with its goal.

  3. mknod tests: Added IsRunningOnGvisor() skip for the positive device creation cases, since the sandbox blocks device node creation regardless of capabilities. The negative tests (EPERM without CAP_MKNOD) still run on both platforms.

The PR now covers 4 verified fixes: SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants