Add missing Linux capability checks for SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority#12872
Add missing Linux capability checks for SO_BINDTODEVICE, mknod, sched_setaffinity, and setpriority#12872petrmarinec wants to merge 4 commits intogoogle:masterfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
This patch adds capability checks that the Linux kernel enforces but gVisor currently omits. Each fix matches the corresponding Linux kernel check, using patterns already established in the gVisor codebase (e.g., keys.go correctly checks CAP_SYS_ADMIN for /proc/sys/kernel/keys/maxkeys). Fixes: 1. SO_BINDTODEVICE: Add CAP_NET_RAW check (net/core/sock.c). gVisor's own tests assert "CAP_NET_RAW is required to use SO_BINDTODEVICE" but the sentry never enforced it. 2. mknod(S_IFBLK/S_IFCHR): Add CAP_MKNOD check (fs/namei.c:vfs_mknod). CAP_MKNOD is defined in the codebase and parsed from OCI specs but was never checked. Unprivileged processes could create device nodes on tmpfs. 3. /proc/sys/net/ipv4/ sysctls: Add CAP_NET_ADMIN checks for tcp_sack, tcp_recovery, tcp_rmem, tcp_wmem, and ip_local_port_range (net/sysctl_net.c:net_ctl_permissions). 4. /proc/sys/fs/nr_open: Add CAP_SYS_ADMIN check (kernel/sysctl.c sysctl_perm). 5. sched_setaffinity: Add UID match / CAP_SYS_NICE check (kernel/sched/core.c:check_same_owner). Any unprivileged process could modify another process's CPU affinity. 6. setpriority: Add UID match / CAP_SYS_NICE check (kernel/sys.c:set_one_prio). Any unprivileged process could change another process's scheduling priority.
bfa76ad to
9cab654
Compare
EtiennePerot
left a comment
There was a problem hiding this comment.
Can you add syscall tests under test/syscalls/linux to exercise these and to ensure consistency with Linux?
Add test/syscalls/linux/capability_checks.cc with tests that verify Linux-compatible capability enforcement for each fix in this PR: - SO_BINDTODEVICE: Verify EPERM without CAP_NET_RAW - mknod(S_IFCHR/S_IFBLK): Verify EPERM without CAP_MKNOD - mknod(S_IFIFO): Verify no capability required (negative test) - /proc/sys/net/ipv4/tcp_sack: Verify EPERM without CAP_NET_ADMIN - /proc/sys/net/ipv4/ip_local_port_range: Verify EPERM without CAP_NET_ADMIN - /proc/sys/fs/nr_open: Verify EPERM without CAP_SYS_ADMIN - sched_setaffinity(other_pid): Verify EPERM without CAP_SYS_NICE - setpriority(other_pid): Verify EPERM without CAP_SYS_NICE Each test uses AutoCapability to drop the relevant capability, then asserts the syscall returns EPERM, matching Linux kernel behavior.
- Remove nonexistent test_main.h include - Fix sched_setaffinity and setpriority tests: fork a child with a different UID (nobody/65534) instead of targeting PID 1, since PID 1 shares UID 0 with the test and would bypass the UID check - Switch proc/sys tests to read/lseek/write for consistency - Add sys/wait.h for waitpid
…ecks
Changes after testing on native Linux (bazel test) and gVisor (runsc_ptrace):
1. SO_BINDTODEVICE: Fix build error - s.HasCapability() does not exist on
the socket.Socket interface. Changed to t.HasCapabilityIn() scoped to
the network namespace's user namespace, matching gVisor's existing
capability checking pattern.
2. Remove /proc/sys capability checks: After testing on native Linux,
writing to tcp_sack, tcp_recovery, tcp_rmem, tcp_wmem,
ip_local_port_range, and nr_open succeeded even after dropping
CAP_NET_ADMIN / CAP_SYS_ADMIN. These checks did not match actual
Linux behavior, so they are removed to keep the PR aligned with its
goal of matching Linux.
3. Tests: Remove proc/sys tests that did not hold on native Linux. Add
IsRunningOnGvisor() skip for mknod positive cases since the sandbox
does not permit creating device nodes regardless of capabilities.
Tested:
- bazel test //test/syscalls:capability_checks_test_native (6/6 passed)
- bazel test //test/syscalls:capability_checks_test_runsc_ptrace
(4 passed, 2 skipped as expected)
|
Tests added in During testing I found and fixed a few issues:
The PR now covers 4 verified fixes: |
Summary
This patch adds capability and permission checks that the Linux kernel enforces but gVisor currently omits. Each fix was verified against native Linux behavior using
bazel teston both native andrunsc_ptraceplatforms.Changes
1.
SO_BINDTODEVICE: AddCAP_NET_RAWcheckFile:
pkg/sentry/socket/netstack/netstack.goLinux reference:
net/core/sock.c:sock_setsockopt()checksns_capable(sock_net(sk)->user_ns, CAP_NET_RAW)Evidence this is unintended: gVisor's own test suite asserts
"CAP_NET_RAW is required to use SO_BINDTODEVICE"(test/syscalls/linux/socket_bind_to_device.cc:52), andSO_RCVBUFFORCEin the same file already correctly checksCAP_NET_ADMIN.2.
mknod(S_IFBLK/S_IFCHR): AddCAP_MKNODcheckFile:
pkg/sentry/syscalls/linux/sys_file.goLinux reference:
fs/namei.c:vfs_mknod()checkscapable(CAP_MKNOD)for block/char device creationEvidence this is unintended:
CAP_MKNODis defined (pkg/abi/linux/capability.go:56), parsed from OCI specs (runsc/specutils/specutils.go:491), and has strace formatting — but is never checked anywhere. ZeroHasCapabilitycalls for it exist in the codebase.3.
sched_setaffinity: Add UID match /CAP_SYS_NICEcheckFile:
pkg/sentry/syscalls/linux/sys_thread.goLinux reference:
kernel/sched/core.c:check_same_owner()requires EUID match orCAP_SYS_NICEImpact: Without this check, any unprivileged process could modify another process's CPU affinity mask.
4.
setpriority: Add UID match /CAP_SYS_NICEcheckFile:
pkg/sentry/syscalls/linux/sys_thread.goLinux reference:
kernel/sys.c:set_one_prio()requires UID match orCAP_SYS_NICEImpact: Without this check, any unprivileged process could change another process's scheduling priority.
Testing
Tests added in
test/syscalls/linux/capability_checks.cc, verified on both native Linux and gVisor:The 2 skipped tests are the mknod positive cases (creating device nodes with
CAP_MKNOD), which are skipped on gVisor because the sandbox does not permit device node creation regardless of capabilities.SoBindToDeviceCapTest.RequiresCapNetRawEPERMwithoutCAP_NET_RAWMknodCapTest.CharDevRequiresCapMknodEPERMforS_IFCHRwithoutCAP_MKNOD(native only)MknodCapTest.BlockDevRequiresCapMknodEPERMforS_IFBLKwithoutCAP_MKNOD(native only)MknodCapTest.FifoDoesNotRequireCapMknodS_IFIFOsucceeds withoutCAP_MKNODSchedSetaffinityCapTest.OtherUidRequiresCapSysNiceEPERMwithout UID match orCAP_SYS_NICESetpriorityCapTest.OtherUidRequiresCapSysNiceEPERMwithout UID match orCAP_SYS_NICE