franz1981 · franz1981 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026
diff --git a/BACKLOG.md b/BACKLOG.md
@@ -0,0 +1,94 @@
+# Backlog
+
+## 1. Kernel CPU accounting under-reports vs hardware PMU counters
+
+**Priority**: Low (understanding only — use `perf stat` as ground truth)
+**Status**: Root cause narrowed, further experiments possible
+
+### Problem
+
+The Linux kernel's CPU time accounting (`/proc/pid/stat`, `schedstat`) consistently under-reports
+CPU utilization compared to `perf stat` (hardware PMU counters) for the Netty custom scheduler workload.
+
+**Measured on NETTY_SCHEDULER with 4 carrier threads pinned to 4 physical cores (Ryzen 9 7950X):**
+
+| Source | CPUs utilized | Notes |
+|--------|--------------|-------|
+| `perf stat` (PMU task-clock) | **3.96** | Hardware counter ground truth |
+| `/proc/pid/stat` (utime+stime) | **3.19** | Kernel accounting |
+| `schedstat` (sum of all thread run_ns) | **3.19** | CFS-level accounting |
+| `pidstat` process-level | **2.84** | Even lower (pidstat's own sampling) |
+| `pidstat` per-thread carrier sum | **2.72** | 4 x ~68% |
+
+### Key findings
+
+- **pidstat is NOT lying** — it faithfully reports what the kernel provides
+- `/proc/pid/stat` and `schedstat` agree perfectly (both at 3.19 CPUs)
+- The kernel itself under-counts by ~0.77 CPUs (19%) vs PMU hardware counters
+- This is NOT caused by `CONFIG_TICK_CPU_ACCOUNTING` — the kernel uses
+  `CONFIG_VIRT_CPU_ACCOUNTING_GEN=y` (full dyntick, precise at context switch boundaries)
+- All 31 kernel-visible threads were accounted for — no "hidden threads" from
+  `-Djdk.trackAllThreads=false` (VTs are invisible to the kernel by design, their CPU time
+  is charged to carrier threads)
+
+### Hypotheses for the 0.77 CPU gap
+
+1. **Kernel scheduling overhead**: `__schedule()` / `finish_task_switch()` runs during
+   thread transitions. Some of this CPU time may be attributed to idle/swapper rather
+   than the thread being switched in/out.
+
+2. **Interrupt handling**: hardware interrupts (NIC, timer) steal cycles from the process.
+   `perf stat` counts all cycles on cores used by the process (task-clock includes time
+   when PMU is inherited by children or interrupted contexts), while `/proc/stat` only
+   counts time explicitly attributed to the thread.
+
+3. **`task-clock` semantics**: `perf stat`'s `task-clock` measures wall-clock time that
+   at least one thread of the process was running. With 4 threads on 4 cores, task-clock
+   closely approximates 4.0 * elapsed. This includes interrupt handling time on those cores
+   that `/proc/stat` charges elsewhere.
+
+4. **Carrier thread park/unpark transitions**: even with VIRT_CPU_ACCOUNTING_GEN, the
+   accounting happens at `schedule()` boundaries. CPU cycles consumed during the entry/exit
+   paths of `LockSupport.park()` (before the actual `schedule()` call and after the wakeup)
+   may be partially lost.
+
+### Further experiments (if desired)
+
+1. **Compare `perf stat -e task-clock` vs `perf stat -e cpu-clock`**: `task-clock` counts
+   per-thread time, `cpu-clock` counts wall time. If they differ, it reveals interrupt overhead.
+
+2. **Run with `nohz_full=4-7` (isolated CPUs)**: removes timer tick interrupts from server
+   cores. If the gap shrinks, interrupt overhead is the cause.
+
+3. **Spin-wait instead of park**: replace `LockSupport.park()` with `Thread.onSpinWait()`
+   in `FifoEventLoopScheduler`. If gap shrinks, park/unpark accounting is lossy.
+
+4. **Check `/proc/interrupts`** delta during benchmark: quantify how many interrupts hit
+   cores 4-7 and estimate their CPU cost.
+
+5. **`perf stat` per-thread (`-t TID`)** for each carrier: compare PMU task-clock per
+   carrier vs schedstat per carrier to see if the gap is evenly distributed.
+
+### Conclusion
+
+For benchmarking purposes, **always use `perf stat` as the ground truth** for CPU utilization.
+pidstat is still useful for relative thread balance analysis and for monitoring non-server
+components (mock server, load generator) where the gap is less significant.
+
+---
+
+## 2. Add spin-wait phase before carrier thread parking
+
+**Priority**: Medium (performance optimization)
+**Status**: Not started
+
+In `FifoEventLoopScheduler.virtualThreadSchedulerLoop()`, the carrier thread parks immediately
+when the queue drains. Adding a brief spin-wait phase (e.g., 100-1000 iterations of
+`Thread.onSpinWait()`) before calling `LockSupport.park()` could:
+
+- Reduce wake-up latency for incoming work (avoid kernel schedule/deschedule)
+- Reduce context switch count (currently ~20/sec, could go to near-zero)
+- Trade-off: slightly higher idle CPU consumption
+
+### Key file
+- `core/src/main/java/io/netty/loom/FifoEventLoopScheduler.java` line ~199
diff --git a/benchmark-runner/README.md b/benchmark-runner/README.md
@@ -49,46 +49,44 @@ The `run-benchmark.sh` script will automatically build the JAR if missing.
 
 ## Configuration
 
-All configuration is via environment variables:
+Configuration via CLI flags (preferred) or environment variables (fallback). CLI flags take precedence.
+
+Run `./run-benchmark.sh --help` for the full list.
+
+### Server
+| CLI flag | Env var | Default | Description |
+|----------|---------|---------|-------------|
+| `--mode` | `SERVER_MODE` | NON_VIRTUAL_NETTY | Mode: NON_VIRTUAL_NETTY, REACTIVE, VIRTUAL_NETTY, NETTY_SCHEDULER |
+| `--threads` | `SERVER_THREADS` | 2 | Number of event loop threads |
+| `--mockless` | `SERVER_MOCKLESS` | false | Skip mock server; do Jackson work inline |
+| `--io` | `SERVER_IO` | epoll | I/O type: epoll, nio, io_uring |
+| `--poller-mode` | `SERVER_POLLER_MODE` | | jdk.pollerMode: 1, 2, or 3 |
+| `--fj-parallelism` | `SERVER_FJ_PARALLELISM` | | ForkJoinPool parallelism |
+| `--server-cpuset` | `SERVER_CPUSET` | 2,3 | CPU pinning |
+| `--jvm-args` | `SERVER_JVM_ARGS` | | Additional JVM arguments |
 
 ### Mock Server
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `MOCK_PORT` | 8080 | Mock server port |
-| `MOCK_THINK_TIME_MS` | 1 | Simulated processing delay (ms) |
-| `MOCK_THREADS` | auto | Number of Netty threads (empty = available processors) |
-| `MOCK_TASKSET` | 4,5,6,7 | CPU affinity (e.g., "0-1") |
-
-### Handoff Server
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `SERVER_PORT` | 8081 | Server port |
-| `SERVER_THREADS` | 2 | Number of event loop threads |
-| `SERVER_REACTIVE` | false | Use reactive handler with Reactor |
-| `SERVER_USE_CUSTOM_SCHEDULER` | false | Use custom Netty scheduler |
-| `SERVER_IO` | epoll | I/O type: epoll, nio, or io_uring |
-| `SERVER_NO_TIMEOUT` | false | Disable HTTP client timeout |
-| `SERVER_TASKSET` | 2,3 | CPU affinity (e.g., "2-5") |
-| `SERVER_JVM_ARGS` | | Additional JVM arguments |
-| `SERVER_POLLER_MODE` | 3 | jdk.pollerMode value: 1, 2, or 3 |
-| `SERVER_FJ_PARALLELISM` | | ForkJoinPool parallelism (empty = JVM default) |
+| CLI flag | Env var | Default | Description |
+|----------|---------|---------|-------------|
+| `--mock-port` | `MOCK_PORT` | 8080 | Mock server port |
+| `--mock-think-time` | `MOCK_THINK_TIME_MS` | 1 | Simulated processing delay (ms) |
+| `--mock-threads` | `MOCK_THREADS` | 1 | Number of Netty threads |
+| `--mock-cpuset` | `MOCK_CPUSET` | 4,5 | CPU pinning |
 
 ### Load Generator
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `LOAD_GEN_CONNECTIONS` | 100 | Number of connections |
-| `LOAD_GEN_THREADS` | 2 | Number of threads |
-| `LOAD_GEN_RATE` | | Target rate (empty = max throughput with wrk) |
-| `LOAD_GEN_TASKSET` | 0,1 | CPU affinity (e.g., "6-7") |
-| `LOAD_GEN_URL` | http://localhost:8081/fruits | Target URL |
+| CLI flag | Env var | Default | Description |
+|----------|---------|---------|-------------|
+| `--connections` | `LOAD_GEN_CONNECTIONS` | 100 | Number of connections |
+| `--load-threads` | `LOAD_GEN_THREADS` | 2 | Number of threads |
+| `--duration` | `LOAD_GEN_DURATION` | 30s | Test duration |
+| `--rate` | `LOAD_GEN_RATE` | | Target rate for wrk2 (omit for max throughput) |
+| `--load-cpuset` | `LOAD_GEN_CPUSET` | 0,1 | CPU pinning |
 
 ### Timing
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `WARMUP_DURATION` | 10s | Warmup duration (no profiling) |
-| `TOTAL_DURATION` | 30s | Total test duration (steady-state must be >= 20s) |
-| `PROFILING_DELAY_SECONDS` | 5 | Delay before starting profiling/perf/JFR |
-| `PROFILING_DURATION_SECONDS` | 10 | Profiling/perf/JFR duration in seconds |
+| CLI flag | Env var | Default | Description |
+|----------|---------|---------|-------------|
+| `--warmup` | `WARMUP_DURATION` | 10s | Warmup duration |
+| `--total-duration` | `TOTAL_DURATION` | 30s | Total test duration (steady-state >= 20s) |
 
 ### Profiling
 | Variable | Default | Description |
@@ -153,73 +151,126 @@ perf stat uses `PROFILING_DELAY_SECONDS` and `PROFILING_DURATION_SECONDS`.
 
 ## Example Runs
 
-### Basic comparison: custom vs default scheduler
+### Choosing CPU pinning with `lscpu -e`
+
+Good benchmarking requires NUMA-aware CPU pinning. Start by inspecting your topology:
 
 ```bash
-# With custom scheduler
-JAVA_HOME=/path/to/jdk \
-SERVER_USE_CUSTOM_SCHEDULER=true \
-./run-benchmark.sh
+$ lscpu -e
+CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
+  0    0      0    0 0:0:0:0          yes    # NUMA 0, physical core 0
+  1    0      0    1 1:1:1:0          yes    # NUMA 0, physical core 1
+  ...
+  8    1      0    8 8:8:8:1          yes    # NUMA 1, physical core 8
+  ...
+ 16    0      0    0 0:0:0:0          yes    # NUMA 0, SMT sibling of core 0
+ 17    0      0    1 1:1:1:0          yes    # NUMA 0, SMT sibling of core 1
+```
+
+Key rules:
+- **Keep all benchmark components on the same NUMA node** to avoid cross-node memory latency
+- **Use physical cores only** (avoid SMT siblings) for more stable results
+- **Isolate noisy processes** (IDEs, browsers) on the other NUMA node
+
+Example layout for a 16-core/2-NUMA system with 4 server threads:
+
+| Component | CPUs | Rationale |
+|-----------|------|-----------|
+| Load generator | 0-1 | 2 physical cores, enough to saturate |
+| Mock server | 2-3 | 2 physical cores for backend simulation |
+| Handoff server | 4-7 | 4 physical cores, one per event loop thread |
+| Other processes | 8-15 | Isolated on NUMA node 1 |
+
+### NETTY_SCHEDULER with 4 threads
+
+```bash
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --io nio \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --jvm-args "-Xms8g -Xmx8g" \
+  --connections 10000 --load-threads 4 \
+  --mock-think-time 30 --mock-threads 4 \
+  --perf-stat
+```
+
+### Analyzing bottlenecks with perf stat
+
+Use `--perf-stat` to get reliable hardware-level metrics. The `perf-stat.txt` output is the
+ground truth for CPU utilization — pidstat per-thread numbers can be misleading with virtual threads.
 
-# With default scheduler  
-JAVA_HOME=/path/to/jdk \
-SERVER_USE_CUSTOM_SCHEDULER=false \
-./run-benchmark.sh
 ```
+Performance counter stats for process id '95868':
+
+  39,741,757,754  task-clock           #  3.970 CPUs utilized
+             806  context-switches     # 20.281 /sec
+     199,114,762,646  instructions     #  1.17 insn per cycle
+   1,338,722,757  branch-misses        #  3.08% of all branches
+```
+
+Key metrics to watch:
+- **CPUs utilized**: how many cores the server is actually using (3.97 of 4 = fully saturated)
+- **Context switches/sec**: lower is better; custom scheduler typically achieves 20-80/sec
+- **IPC (insn per cycle)**: higher is better; >1.0 is good, <0.5 suggests memory stalls
+- **Branch misses**: >5% suggests unpredictable control flow
 
-### With CPU pinning
+If CPUs utilized equals your allocated core count, the server is CPU-bound — add more cores.
+If context switches are high (>10K/sec), the scheduler or OS is thrashing.
+
+pidstat is still useful for spotting **mock server or load generator bottlenecks** —
+check `pidstat-mock.log` and `pidstat-loadgen.log` to ensure they aren't saturated.
+
+### NON_VIRTUAL_NETTY (default mode)
 
 ```bash
-JAVA_HOME=/path/to/jdk \
-MOCK_TASKSET="0" \
-SERVER_TASKSET="1-4" \
-LOAD_GEN_TASKSET="5-7" \
-SERVER_THREADS=4 \
-SERVER_USE_CUSTOM_SCHEDULER=true \
-./run-benchmark.sh
+./run-benchmark.sh --threads 4 \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --connections 10000 --mock-think-time 30
 ```
 
-### With profiling
+### VIRTUAL_NETTY mode
 
 ```bash
-JAVA_HOME=/path/to/jdk \
-ENABLE_PROFILER=true \
-ASYNC_PROFILER_PATH=/path/to/async-profiler \
-PROFILER_EVENT=cpu \
-SERVER_USE_CUSTOM_SCHEDULER=true \
-WARMUP_DURATION=15s \
-TOTAL_DURATION=45s \
-./run-benchmark.sh
+./run-benchmark.sh --mode VIRTUAL_NETTY --threads 4 --io nio \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --connections 10000 --mock-think-time 30
 ```
 
-### With JFR events enabled (subset)
+### Mockless mode (skip HTTP call to mock, inline Jackson work)
 
 ```bash
-JAVA_HOME=/path/to/jdk \
-ENABLE_JFR=true \
-JFR_EVENTS=NettyRunIo,VirtualThreadTaskRuns \
-SERVER_USE_CUSTOM_SCHEDULER=true \
-./run-benchmark.sh
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --mockless \
+  --server-cpuset "4-7" --load-cpuset "0-1" \
+  --connections 10000
+```
+
+### With async-profiler
+
+```bash
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --profiler --profiler-path /path/to/async-profiler \
+  --warmup 15s --total-duration 45s
 ```
 
 ### Rate-limited test with wrk2
 
 ```bash
-JAVA_HOME=/path/to/jdk \
-LOAD_GEN_RATE=10000 \
-LOAD_GEN_CONNECTIONS=200 \
-TOTAL_DURATION=60s \
-WARMUP_DURATION=15s \
-./run-benchmark.sh
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --rate 120000 --connections 10000 --total-duration 60s --warmup 15s
 ```
 
-### With pidstat monitoring
+### With JFR events
 
 ```bash
-JAVA_HOME=/path/to/jdk \
-ENABLE_PIDSTAT=true \
-PIDSTAT_INTERVAL=1 \
-./run-benchmark.sh
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --jfr --jfr-events NettyRunIo,VirtualThreadTaskRuns
+```
+
+### Mixed: CLI flags + env vars
+
+```bash
+SERVER_JVM_ARGS="-XX:+PrintGCDetails" ./run-benchmark.sh --mode VIRTUAL_NETTY --threads 2
 ```
 
 ## Output
@@ -260,7 +311,7 @@ java -cp benchmark-runner/target/benchmark-runner.jar \
   8080 1    # port, thinkTimeMs (threads defaults to available processors)
 ```
 
-### Handoff Server (with custom scheduler)
+### Handoff Server (custom scheduler mode)
 
 ```bash
 java \
@@ -275,11 +326,11 @@ java \
   --port 8081 \
   --mock-url http://localhost:8080/fruits \
   --threads 2 \
-  --use-custom-scheduler true \
+  --mode netty_scheduler \
   --io epoll
 ```
 
-### Handoff Server (with default scheduler)
+### Handoff Server (default split topology)
 
 ```bash
 java \
@@ -292,6 +343,5 @@ java \
   --port 8081 \
   --mock-url http://localhost:8080/fruits \
   --threads 2 \
-  --use-custom-scheduler false \
   --io epoll
 ```