BUG and stack trace, as collected from console logs, has been included below.
The generic driver was being exercised with an AWS ENA driver through the use of pkt-gen in receive mode, invoked thus:
sudo pkt-gen -i ens7 -f rx -c 1
The crash occurred regardless of whether any packets were received.
Environment
Netmap was built and run on an AWS EC2 Ubuntu 22.04.4 LTS instance, with Linux kernel version 6.8.0-1033-aws.
Fix?
The stack trace shows invalid memory access to 0000000000000001 from within hrtimer_interrupt . The combination of hrtimer execution and 0x0000000000000001 as the address accessed, which is the value of the CLOCK_MONOTONIC clock ID used by the netmap generic driver, seems to point to the nm_hrtimer_setup macro in LINUX/bsd_glue.h
Specifically, it looks like that macro is erroneously assigning the clock ID (c_) argument as the timer function, where it should instead be using the f_ argument.
The big hint for this actually came from the compilation warning below:
In file included from -/LINUX/netmap_linux.c:26:
-/LINUX/netmap_linux.c: In function ‘nm_os_mitigation_init’:
-/LINUX/bsd_glue.h:86:24: warning: assignment to ‘enum hrtimer_restart (*)(struct hrtimer *)’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
86 | (t_)->function = (c_); \
| ^
-/LINUX/netmap_linux.c:513:9: note: in expansion of macro ‘nm_hrtimer_setup’
513 | nm_hrtimer_setup(&mit->mit_timer, &generic_timer_handler,
| ^~~~~~~~~~~~~~~~
-/LINUX/netmap_linux.c: At top level:
-/LINUX/netmap_linux.c:483:1: warning: ‘generic_timer_handler’
After the speculative patch below was applied locally, the crash no longer occurred.
diff --git a/LINUX/bsd_glue.h b/LINUX/bsd_glue.h
index 9c42bdcb..5aefb98b 100644
--- a/LINUX/bsd_glue.h
+++ b/LINUX/bsd_glue.h
@@ -83,7 +83,7 @@
#else
#define nm_hrtimer_setup(t_, f_, c_, m_) do { \
hrtimer_init(t_, c_, m_); \
- (t_)->function = (c_); \
+ (t_)->function = (f_); \
} while (0)
#endif
Stack Trace
Stack trace, from console logs:
[ 918.086771] BUG: kernel NULL pointer dereference, address: 0000000000000001
[ 918.087640] #PF: supervisor instruction fetch in kernel mode
[ 918.088338] #PF: error_code(0x0010) - not-present page
[ 918.088967] PGD 0 P4D 0
[ 918.089310] Oops: 0010 [#1] SMP NOPTI
[ 918.089782] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G OE 6.8.0-1033-aws #35~22.04.1-Ubuntu
[ 918.090899] Hardware name: Amazon EC2 m5a.large/, BIOS 1.0 10/16/2017
[ 918.091665] RIP: 0010:0x1
[ 918.092033] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[ 918.092808] RSP: 0018:ffffa697000f4f08 EFLAGS: 00010046
[ 918.093446] RAX: 0000000000000000 RBX: ffff893e91cb0e40 RCX: 0000000000000000
[ 918.094298] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff893e91cb0e40
[ 918.095153] RBP: ffffa697000f4f68 R08: 0000000000000000 R09: 0000000000000000
[ 918.096010] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 918.096861] R13: ffff893eedf247c0 R14: ffff893eedf247c0 R15: ffff893eedf24800
[ 918.097718] FS: 0000000000000000(0000) GS:ffff893eedf00000(0000) knlGS:0000000000000000
[ 918.098688] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 918.099383] CR2: ffffffffffffffd7 CR3: 00000001924ec000 CR4: 00000000003506f0
[ 918.100254] Call Trace:
[ 918.100588] <IRQ>
[ 918.100872] ? show_regs+0x6d/0x80
[ 918.101307] ? __die+0x24/0x80
[ 918.101716] ? page_fault_oops+0x99/0x1b0
[ 918.102217] ? do_user_addr_fault+0x2ee/0x670
[ 918.102760] ? exc_page_fault+0x83/0x190
[ 918.103254] ? asm_exc_page_fault+0x27/0x30
[ 918.103783] ? __hrtimer_run_queues+0x112/0x250
[ 918.104349] ? srso_return_thunk+0x5/0x5f
[ 918.104854] hrtimer_interrupt+0xf6/0x250
[ 918.105361] __sysvec_apic_timer_interrupt+0x4e/0xf0
[ 918.105973] sysvec_apic_timer_interrupt+0x8d/0xd0
[ 918.106567] </IRQ>
[ 918.106858] <TASK>
[ 918.107696] asm_sysvec_apic_timer_interrupt+0x1b/0x20
[ 918.108818] RIP: 0010:pv_native_safe_halt+0xb/0x10
[ 918.109895] Code: 22 d7 31 ff e9 b6 28 01 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 eb 07 0f 00 2d 59 e3 3e 00 fb f4 <e9> 90 28 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 83
[ 918.113509] RSP: 0018:ffffa697000b7db0 EFLAGS: 00000246
[ 918.114646] RAX: 0000000000004000 RBX: ffff893dc0e33064 RCX: 0000000000000000
[ 918.116006] RDX: 0000000000000001 RSI: ffff893dc0e33000 RDI: 0000000000000001
[ 918.117361] RBP: ffffa697000b7db8 R08: 0000000000000000 R09: 0000000000000000
[ 918.118701] R10: 0000000000000000 R11: 0000000000000000 R12: ffff893dc0e33064
[ 918.120043] R13: 0000000000000001 R14: ffffffff91af9240 R15: ffff893eedf00000
[ 918.121387] ? acpi_safe_halt+0x19/0x60
[ 918.122357] acpi_idle_do_entry+0x40/0x80
[ 918.123334] acpi_idle_enter+0xb6/0x180
[ 918.124302] cpuidle_enter_state+0x91/0x6f0
[ 918.125293] ? srso_return_thunk+0x5/0x5f
[ 918.126264] ? finish_task_switch.isra.0+0x89/0x2f0
[ 918.127332] cpuidle_enter+0x2e/0x50
[ 918.128277] call_cpuidle+0x23/0x60
[ 918.129179] cpuidle_idle_call+0x10f/0x150
[ 918.130136] do_idle+0x87/0xf0
[ 918.130966] cpu_startup_entry+0x2a/0x30
[ 918.131874] start_secondary+0x129/0x160
[ 918.132798] secondary_startup_64_no_verify+0x184/0x18b
[ 918.133862] </TASK>
[ 918.134588] Modules linked in: netmap(OE) tls binfmt_misc nls_iso8859_1 ppdev crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd parport_pc cryptd input_leds parport psmouse serio_raw ena dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore ip_tables x_tables autofs4 [last unloaded: netmap(OE)]
[ 918.142596] CR2: 0000000000000001
[ 918.143535] ---[ end trace 0000000000000000 ]---
[ 918.274181] RIP: 0010:0x1
[ 918.275426] Code: Unable to access opcode bytes at 0xffffffffffffffd7.
[ 918.276786] RSP: 0018:ffffa697000f4f08 EFLAGS: 00010046
[ 918.277978] RAX: 0000000000000000 RBX: ffff893e91cb0e40 RCX: 0000000000000000
[ 918.279383] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff893e91cb0e40
[ 918.280802] RBP: ffffa697000f4f68 R08: 0000000000000000 R09: 0000000000000000
[ 918.282209] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 918.283590] R13: ffff893eedf247c0 R14: ffff893eedf247c0 R15: ffff893eedf24800
[ 918.284963] FS: 0000000000000000(0000) GS:ffff893eedf00000(0000) knlGS:0000000000000000
[ 918.286432] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 918.287663] CR2: ffffffffffffffd7 CR3: 00000001924ec000 CR4: 00000000003506f0
[ 918.289064] Kernel panic - not syncing: Fatal exception in interrupt
[ 918.290690] Kernel Offset: 0xe400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
BUG and stack trace, as collected from console logs, has been included below.
The generic driver was being exercised with an AWS ENA driver through the use of
pkt-genin receive mode, invoked thus:The crash occurred regardless of whether any packets were received.
Environment
Netmap was built and run on an AWS EC2 Ubuntu 22.04.4 LTS instance, with Linux kernel version 6.8.0-1033-aws.
Fix?
The stack trace shows invalid memory access to
0000000000000001from withinhrtimer_interrupt. The combination ofhrtimerexecution and0x0000000000000001as the address accessed, which is the value of theCLOCK_MONOTONICclock ID used by the netmap generic driver, seems to point to thenm_hrtimer_setup macroinLINUX/bsd_glue.hSpecifically, it looks like that macro is erroneously assigning the clock ID (
c_) argument as the timer function, where it should instead be using thef_argument.The big hint for this actually came from the compilation warning below:
After the speculative patch below was applied locally, the crash no longer occurred.
Stack Trace
Stack trace, from console logs: