Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions BACKLOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Backlog

## 1. Kernel CPU accounting under-reports vs hardware PMU counters

**Priority**: Low (understanding only — use `perf stat` as ground truth)
**Status**: Root cause narrowed, further experiments possible

### Problem

The Linux kernel's CPU time accounting (`/proc/pid/stat`, `schedstat`) consistently under-reports
CPU utilization compared to `perf stat` (hardware PMU counters) for the Netty custom scheduler workload.

**Measured on NETTY_SCHEDULER with 4 carrier threads pinned to 4 physical cores (Ryzen 9 7950X):**

| Source | CPUs utilized | Notes |
|--------|--------------|-------|
| `perf stat` (PMU task-clock) | **3.96** | Hardware counter ground truth |
| `/proc/pid/stat` (utime+stime) | **3.19** | Kernel accounting |
| `schedstat` (sum of all thread run_ns) | **3.19** | CFS-level accounting |
| `pidstat` process-level | **2.84** | Even lower (pidstat's own sampling) |
| `pidstat` per-thread carrier sum | **2.72** | 4 x ~68% |

### Key findings

- **pidstat is NOT lying** — it faithfully reports what the kernel provides
- `/proc/pid/stat` and `schedstat` agree perfectly (both at 3.19 CPUs)
- The kernel itself under-counts by ~0.77 CPUs (19%) vs PMU hardware counters
- This is NOT caused by `CONFIG_TICK_CPU_ACCOUNTING` — the kernel uses
`CONFIG_VIRT_CPU_ACCOUNTING_GEN=y` (full dyntick, precise at context switch boundaries)
- All 31 kernel-visible threads were accounted for — no "hidden threads" from
`-Djdk.trackAllThreads=false` (VTs are invisible to the kernel by design, their CPU time
is charged to carrier threads)

### Hypotheses for the 0.77 CPU gap

1. **Kernel scheduling overhead**: `__schedule()` / `finish_task_switch()` runs during
thread transitions. Some of this CPU time may be attributed to idle/swapper rather
than the thread being switched in/out.

2. **Interrupt handling**: hardware interrupts (NIC, timer) steal cycles from the process.
`perf stat` counts all cycles on cores used by the process (task-clock includes time
when PMU is inherited by children or interrupted contexts), while `/proc/stat` only
counts time explicitly attributed to the thread.

3. **`task-clock` semantics**: `perf stat`'s `task-clock` measures wall-clock time that
at least one thread of the process was running. With 4 threads on 4 cores, task-clock
closely approximates 4.0 * elapsed. This includes interrupt handling time on those cores
that `/proc/stat` charges elsewhere.

4. **Carrier thread park/unpark transitions**: even with VIRT_CPU_ACCOUNTING_GEN, the
accounting happens at `schedule()` boundaries. CPU cycles consumed during the entry/exit
paths of `LockSupport.park()` (before the actual `schedule()` call and after the wakeup)
may be partially lost.

### Further experiments (if desired)

1. **Compare `perf stat -e task-clock` vs `perf stat -e cpu-clock`**: `task-clock` counts
per-thread time, `cpu-clock` counts wall time. If they differ, it reveals interrupt overhead.

2. **Run with `nohz_full=4-7` (isolated CPUs)**: removes timer tick interrupts from server
cores. If the gap shrinks, interrupt overhead is the cause.

3. **Spin-wait instead of park**: replace `LockSupport.park()` with `Thread.onSpinWait()`
in `FifoEventLoopScheduler`. If gap shrinks, park/unpark accounting is lossy.

4. **Check `/proc/interrupts`** delta during benchmark: quantify how many interrupts hit
cores 4-7 and estimate their CPU cost.

5. **`perf stat` per-thread (`-t TID`)** for each carrier: compare PMU task-clock per
carrier vs schedstat per carrier to see if the gap is evenly distributed.

### Conclusion

For benchmarking purposes, **always use `perf stat` as the ground truth** for CPU utilization.
pidstat is still useful for relative thread balance analysis and for monitoring non-server
components (mock server, load generator) where the gap is less significant.

---

## 2. Add spin-wait phase before carrier thread parking

**Priority**: Medium (performance optimization)
**Status**: Not started

In `FifoEventLoopScheduler.virtualThreadSchedulerLoop()`, the carrier thread parks immediately
when the queue drains. Adding a brief spin-wait phase (e.g., 100-1000 iterations of
`Thread.onSpinWait()`) before calling `LockSupport.park()` could:

- Reduce wake-up latency for incoming work (avoid kernel schedule/deschedule)
- Reduce context switch count (currently ~20/sec, could go to near-zero)
- Trade-off: slightly higher idle CPU consumption

### Key file
- `core/src/main/java/io/netty/loom/FifoEventLoopScheduler.java` line ~199
212 changes: 131 additions & 81 deletions benchmark-runner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,46 +49,44 @@ The `run-benchmark.sh` script will automatically build the JAR if missing.

## Configuration

All configuration is via environment variables:
Configuration via CLI flags (preferred) or environment variables (fallback). CLI flags take precedence.

Run `./run-benchmark.sh --help` for the full list.

### Server
| CLI flag | Env var | Default | Description |
|----------|---------|---------|-------------|
| `--mode` | `SERVER_MODE` | NON_VIRTUAL_NETTY | Mode: NON_VIRTUAL_NETTY, REACTIVE, VIRTUAL_NETTY, NETTY_SCHEDULER |
| `--threads` | `SERVER_THREADS` | 2 | Number of event loop threads |
| `--mockless` | `SERVER_MOCKLESS` | false | Skip mock server; do Jackson work inline |
| `--io` | `SERVER_IO` | epoll | I/O type: epoll, nio, io_uring |
| `--poller-mode` | `SERVER_POLLER_MODE` | | jdk.pollerMode: 1, 2, or 3 |
| `--fj-parallelism` | `SERVER_FJ_PARALLELISM` | | ForkJoinPool parallelism |
| `--server-cpuset` | `SERVER_CPUSET` | 2,3 | CPU pinning |
| `--jvm-args` | `SERVER_JVM_ARGS` | | Additional JVM arguments |

### Mock Server
| Variable | Default | Description |
|----------|---------|-------------|
| `MOCK_PORT` | 8080 | Mock server port |
| `MOCK_THINK_TIME_MS` | 1 | Simulated processing delay (ms) |
| `MOCK_THREADS` | auto | Number of Netty threads (empty = available processors) |
| `MOCK_TASKSET` | 4,5,6,7 | CPU affinity (e.g., "0-1") |

### Handoff Server
| Variable | Default | Description |
|----------|---------|-------------|
| `SERVER_PORT` | 8081 | Server port |
| `SERVER_THREADS` | 2 | Number of event loop threads |
| `SERVER_REACTIVE` | false | Use reactive handler with Reactor |
| `SERVER_USE_CUSTOM_SCHEDULER` | false | Use custom Netty scheduler |
| `SERVER_IO` | epoll | I/O type: epoll, nio, or io_uring |
| `SERVER_NO_TIMEOUT` | false | Disable HTTP client timeout |
| `SERVER_TASKSET` | 2,3 | CPU affinity (e.g., "2-5") |
| `SERVER_JVM_ARGS` | | Additional JVM arguments |
| `SERVER_POLLER_MODE` | 3 | jdk.pollerMode value: 1, 2, or 3 |
| `SERVER_FJ_PARALLELISM` | | ForkJoinPool parallelism (empty = JVM default) |
| CLI flag | Env var | Default | Description |
|----------|---------|---------|-------------|
| `--mock-port` | `MOCK_PORT` | 8080 | Mock server port |
| `--mock-think-time` | `MOCK_THINK_TIME_MS` | 1 | Simulated processing delay (ms) |
| `--mock-threads` | `MOCK_THREADS` | 1 | Number of Netty threads |
| `--mock-cpuset` | `MOCK_CPUSET` | 4,5 | CPU pinning |

### Load Generator
| Variable | Default | Description |
|----------|---------|-------------|
| `LOAD_GEN_CONNECTIONS` | 100 | Number of connections |
| `LOAD_GEN_THREADS` | 2 | Number of threads |
| `LOAD_GEN_RATE` | | Target rate (empty = max throughput with wrk) |
| `LOAD_GEN_TASKSET` | 0,1 | CPU affinity (e.g., "6-7") |
| `LOAD_GEN_URL` | http://localhost:8081/fruits | Target URL |
| CLI flag | Env var | Default | Description |
|----------|---------|---------|-------------|
| `--connections` | `LOAD_GEN_CONNECTIONS` | 100 | Number of connections |
| `--load-threads` | `LOAD_GEN_THREADS` | 2 | Number of threads |
| `--duration` | `LOAD_GEN_DURATION` | 30s | Test duration |
| `--rate` | `LOAD_GEN_RATE` | | Target rate for wrk2 (omit for max throughput) |
| `--load-cpuset` | `LOAD_GEN_CPUSET` | 0,1 | CPU pinning |

### Timing
| Variable | Default | Description |
|----------|---------|-------------|
| `WARMUP_DURATION` | 10s | Warmup duration (no profiling) |
| `TOTAL_DURATION` | 30s | Total test duration (steady-state must be >= 20s) |
| `PROFILING_DELAY_SECONDS` | 5 | Delay before starting profiling/perf/JFR |
| `PROFILING_DURATION_SECONDS` | 10 | Profiling/perf/JFR duration in seconds |
| CLI flag | Env var | Default | Description |
|----------|---------|---------|-------------|
| `--warmup` | `WARMUP_DURATION` | 10s | Warmup duration |
| `--total-duration` | `TOTAL_DURATION` | 30s | Total test duration (steady-state >= 20s) |

### Profiling
| Variable | Default | Description |
Expand Down Expand Up @@ -153,73 +151,126 @@ perf stat uses `PROFILING_DELAY_SECONDS` and `PROFILING_DURATION_SECONDS`.

## Example Runs

### Basic comparison: custom vs default scheduler
### Choosing CPU pinning with `lscpu -e`

Good benchmarking requires NUMA-aware CPU pinning. Start by inspecting your topology:

```bash
# With custom scheduler
JAVA_HOME=/path/to/jdk \
SERVER_USE_CUSTOM_SCHEDULER=true \
./run-benchmark.sh
$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes # NUMA 0, physical core 0
1 0 0 1 1:1:1:0 yes # NUMA 0, physical core 1
...
8 1 0 8 8:8:8:1 yes # NUMA 1, physical core 8
...
16 0 0 0 0:0:0:0 yes # NUMA 0, SMT sibling of core 0
17 0 0 1 1:1:1:0 yes # NUMA 0, SMT sibling of core 1
```

Key rules:
- **Keep all benchmark components on the same NUMA node** to avoid cross-node memory latency
- **Use physical cores only** (avoid SMT siblings) for more stable results
- **Isolate noisy processes** (IDEs, browsers) on the other NUMA node

Example layout for a 16-core/2-NUMA system with 4 server threads:

| Component | CPUs | Rationale |
|-----------|------|-----------|
| Load generator | 0-1 | 2 physical cores, enough to saturate |
| Mock server | 2-3 | 2 physical cores for backend simulation |
| Handoff server | 4-7 | 4 physical cores, one per event loop thread |
| Other processes | 8-15 | Isolated on NUMA node 1 |

### NETTY_SCHEDULER with 4 threads

```bash
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --io nio \
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
--jvm-args "-Xms8g -Xmx8g" \
--connections 10000 --load-threads 4 \
--mock-think-time 30 --mock-threads 4 \
--perf-stat
```

### Analyzing bottlenecks with perf stat

Use `--perf-stat` to get reliable hardware-level metrics. The `perf-stat.txt` output is the
ground truth for CPU utilization — pidstat per-thread numbers can be misleading with virtual threads.

# With default scheduler
JAVA_HOME=/path/to/jdk \
SERVER_USE_CUSTOM_SCHEDULER=false \
./run-benchmark.sh
```
Performance counter stats for process id '95868':
39,741,757,754 task-clock # 3.970 CPUs utilized
806 context-switches # 20.281 /sec
199,114,762,646 instructions # 1.17 insn per cycle
1,338,722,757 branch-misses # 3.08% of all branches
```

Key metrics to watch:
- **CPUs utilized**: how many cores the server is actually using (3.97 of 4 = fully saturated)
- **Context switches/sec**: lower is better; custom scheduler typically achieves 20-80/sec
- **IPC (insn per cycle)**: higher is better; >1.0 is good, <0.5 suggests memory stalls
- **Branch misses**: >5% suggests unpredictable control flow

### With CPU pinning
If CPUs utilized equals your allocated core count, the server is CPU-bound — add more cores.
If context switches are high (>10K/sec), the scheduler or OS is thrashing.

pidstat is still useful for spotting **mock server or load generator bottlenecks**
check `pidstat-mock.log` and `pidstat-loadgen.log` to ensure they aren't saturated.

### NON_VIRTUAL_NETTY (default mode)

```bash
JAVA_HOME=/path/to/jdk \
MOCK_TASKSET="0" \
SERVER_TASKSET="1-4" \
LOAD_GEN_TASKSET="5-7" \
SERVER_THREADS=4 \
SERVER_USE_CUSTOM_SCHEDULER=true \
./run-benchmark.sh
./run-benchmark.sh --threads 4 \
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
--connections 10000 --mock-think-time 30
```

### With profiling
### VIRTUAL_NETTY mode

```bash
JAVA_HOME=/path/to/jdk \
ENABLE_PROFILER=true \
ASYNC_PROFILER_PATH=/path/to/async-profiler \
PROFILER_EVENT=cpu \
SERVER_USE_CUSTOM_SCHEDULER=true \
WARMUP_DURATION=15s \
TOTAL_DURATION=45s \
./run-benchmark.sh
./run-benchmark.sh --mode VIRTUAL_NETTY --threads 4 --io nio \
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
--connections 10000 --mock-think-time 30
```

### With JFR events enabled (subset)
### Mockless mode (skip HTTP call to mock, inline Jackson work)

```bash
JAVA_HOME=/path/to/jdk \
ENABLE_JFR=true \
JFR_EVENTS=NettyRunIo,VirtualThreadTaskRuns \
SERVER_USE_CUSTOM_SCHEDULER=true \
./run-benchmark.sh
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --mockless \
--server-cpuset "4-7" --load-cpuset "0-1" \
--connections 10000
```

### With async-profiler

```bash
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
--profiler --profiler-path /path/to/async-profiler \
--warmup 15s --total-duration 45s
```

### Rate-limited test with wrk2

```bash
JAVA_HOME=/path/to/jdk \
LOAD_GEN_RATE=10000 \
LOAD_GEN_CONNECTIONS=200 \
TOTAL_DURATION=60s \
WARMUP_DURATION=15s \
./run-benchmark.sh
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
--rate 120000 --connections 10000 --total-duration 60s --warmup 15s
```

### With pidstat monitoring
### With JFR events

```bash
JAVA_HOME=/path/to/jdk \
ENABLE_PIDSTAT=true \
PIDSTAT_INTERVAL=1 \
./run-benchmark.sh
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
--jfr --jfr-events NettyRunIo,VirtualThreadTaskRuns
```

### Mixed: CLI flags + env vars

```bash
SERVER_JVM_ARGS="-XX:+PrintGCDetails" ./run-benchmark.sh --mode VIRTUAL_NETTY --threads 2
```

## Output
Expand Down Expand Up @@ -260,7 +311,7 @@ java -cp benchmark-runner/target/benchmark-runner.jar \
8080 1 # port, thinkTimeMs (threads defaults to available processors)
```

### Handoff Server (with custom scheduler)
### Handoff Server (custom scheduler mode)

```bash
java \
Expand All @@ -275,11 +326,11 @@ java \
--port 8081 \
--mock-url http://localhost:8080/fruits \
--threads 2 \
--use-custom-scheduler true \
--mode netty_scheduler \
--io epoll
```

### Handoff Server (with default scheduler)
### Handoff Server (default split topology)

```bash
java \
Expand All @@ -292,6 +343,5 @@ java \
--port 8081 \
--mock-url http://localhost:8080/fruits \
--threads 2 \
--use-custom-scheduler false \
--io epoll
```
Loading
Loading