Copyright 2025-2026 Timo Heimonen timo.heimonen@proton.me
License: GPL-3.0 license
Low-level tool to measure memory read/write/copy bandwidth, cache/main memory latency, and access pattern performance on macOS Apple Silicon (ARM64).

Macbook Air M5 Cache Latency by example-script provided. Using different size TLB locality and Stride size.

Macbook Air M5 memory latency from JSON file
DRAM TLB-hit ~9,8ns is really nice!
More Apple Silicon M5 results in /results/0.53.8/ -folder
memory_benchmark measures memory behavior on macOS Apple Silicon with an implementation focused on practical low-level analysis:
- Main memory bandwidth (read/write/copy).
- Cache bandwidth (L1/L2 auto-detected, or custom cache target).
- Main memory latency (dependent pointer chase).
- Cache latency (dependent pointer chase).
- Pattern bandwidth behavior (sequential, strided, random).
The benchmark uses ARM64 assembly kernels, warmup passes, and loop statistics to produce stable and comparable results.
This project is built to characterize Apple Silicon memory behavior in a direct and reproducible way.
Compared to generic benchmark suites, it is intentionally focused on:
- Native ARM64 execution on macOS.
- Cache-aware and latency-aware test paths.
- Explicit mode control for bandwidth-only, latency-only, and pattern-only workflows.
- JSON output that is easy to automate and plot.
Use this software at your own risk. It performs sustained, intensive memory operations. The author is not responsible for instability, data loss, or hardware issues resulting from use.
- macOS
- Apple Silicon (ARM64)
brew install timoheimonen/macOS-memory-benchmark/memory-benchmarkPrerequisites:
- Xcode Command Line Tools (
xcode-select --install)
Build:
git clone https://github.com/timoheimonen/macOS-memory-benchmark.git
cd macOS-memory-benchmark
makeRun from source tree (without installing to PATH):
./memory_benchmark -hInstall test dependency:
brew install googletestRun unit tests:
make testExamples below use the Homebrew/PATH form (memory_benchmark ...).
If you are running directly from a local source build, use ./memory_benchmark ....
Help:
memory_benchmark -hDefault run:
memory_benchmarkFor longer runs, prevent sleep:
caffeinate -i -d memory_benchmark -count 10 -buffersize 1024- Default mode: Runs main bandwidth + main latency + cache bandwidth + cache latency.
-patterns: Runs pattern bandwidth suite only (sequential_forward,sequential_reverse,strided_64,strided_4096,strided_16384,strided_2mb,random).-only-bandwidth: Runs bandwidth paths only (-patterns,-cache-size, and-latency-samplesare not allowed in this mode).-only-latency: Runs latency paths only (-patternsand-iterationsare not allowed in this mode).-analyze-tlb: Runs standalone TLB analysis mode; only optional-output <file>,-latency-stride-bytes <bytes>, and-latency-chain-mode <mode>may be combined with it.-analyze-core2core: Runs standalone core-to-core cache-line handoff analysis mode; only optional-output <file>,-count <count>, and-latency-samples <count>may be combined with it.
Latency-specific disable controls in -only-latency:
-buffersize 0disables main-memory latency target.-cache-size 0disables cache-latency target.- Both disabled at once is invalid.
-buffersize <MB>: Main buffer size (default512; auto-capped by memory safety rules).-iterations <count>: Bandwidth iterations per loop (default1000).-count <count>: Full benchmark repetitions (default1; use5-10for statistics).-threads <count>: Bandwidth thread count (latency tests remain single-threaded).-cache-size <KB>: Custom cache target. Non-zero range is16to1048576KB (1 GB).-analyze-tlb: Standalone TLB-boundary detection benchmark (1024/512/256 MBfallback buffer selection), sweeping locality windows frommax(16 KB, 2*stride)to256 MB(plus optional512 MBpage-walk comparison when buffer is at least512 MB). Supports optional-latency-stride-bytes <bytes>and-latency-chain-mode <mode>.-analyze-core2core: Standalone two-thread cache-line ping-pong benchmark for coherence handoff latency, with three scheduler-hint scenarios (no_affinity_hint,same_affinity_tag,different_affinity_tags). Reports round-trip and one-way-estimate latency plus percentiles.-latency-samples <count>: Samples per latency test (default1000).-latency-stride-bytes <bytes>: Pointer-chain stride for latency tests (default64; must be > 0 and pointer-size aligned).-latency-chain-mode <mode>: Pointer-chain construction policy. Modes:auto(default),global-random,random-box,same-random-in-box,diff-random-in-box.-latency-tlb-locality-kb <KB>: Pointer-chain locality window (default16;0= global random chain; non-zero values must be page-size multiples). If omitted, regular main-memory latency output also includes an automatic TLB comparison (16 KBhit-biased vs0miss-biased) and estimated page-walk penalty.-non-cacheable: Best-effort cache-discouraging hints (not true uncached memory).-output <file>: Save JSON output.
Statistical baseline:
caffeinate -i -d memory_benchmark -count 10 -buffersize 1024 -output baseline.jsonPattern analysis:
memory_benchmark -patterns -count 10 -buffersize 512 -output patterns.jsonLatency locality comparison:
memory_benchmark -only-latency -buffersize 1024 -count 10 -latency-samples 5000 -latency-tlb-locality-kb 16 -output lat_tlb16.json
memory_benchmark -only-latency -buffersize 1024 -count 10 -latency-samples 5000 -latency-tlb-locality-kb 0 -output lat_global.jsonRegular benchmark with automatic DRAM TLB breakdown (omit -latency-tlb-locality-kb):
memory_benchmark -latency-stride-bytes 128 -count 1TLB-vs-cache isolation (smaller stride within pages):
memory_benchmark -only-latency -buffersize 1024 -cache-size 4096 -latency-stride-bytes 64 -latency-tlb-locality-kb 16 -count 10 -latency-samples 5000 -output lat_stride64_tlb16.jsonCustom cache target:
memory_benchmark -cache-size 4096 -threads 1 -count 5 -output cache_4mb.jsonStandalone TLB analysis report:
memory_benchmark -analyze-tlbStandalone TLB analysis with JSON export:
memory_benchmark -analyze-tlb -output tlb_analysis.jsonStandalone TLB analysis with custom stride:
memory_benchmark -analyze-tlb -latency-stride-bytes 128 -output tlb_analysis_stride128.jsonStandalone core-to-core handoff analysis:
memory_benchmark -analyze-core2coreStandalone core-to-core handoff analysis with JSON export and custom sample depth:
memory_benchmark -analyze-core2core -count 5 -latency-samples 2000 -output core2core.jsonConsole output includes:
- Resolved configuration and cache information.
- Per-loop benchmark results.
- Main-memory latency may include automatic TLB breakdown lines (
TLB hit latency,TLB miss latency, andEstimated page-walk penalty) when-latency-tlb-locality-kbis not explicitly set. - Aggregate statistics when
-count > 1(including P50/P90/P95/P99 and stddev). In auto-TLB mode, statistics also includeTLB Hit Latency (ns),TLB Miss Latency (ns), andEstimated Page-Walk Penalty (ns).
JSON output shape:
- Standard mode:
{
"configuration": {},
"execution_time_sec": 0,
"main_memory": {},
"cache": {},
"timestamp": "...",
"version": "..."
}- Pattern mode:
{
"configuration": {},
"execution_time_sec": 0,
"patterns": {},
"timestamp": "...",
"version": "..."
}Current latency payload is nested (not scalar):
"latency": {
"average_ns": {
"values": [],
"statistics": {}
},
"samples_ns": {
"values": [],
"statistics": {}
},
"auto_tlb_breakdown": {
"tlb_hit_ns": {
"values": [],
"statistics": {}
},
"tlb_miss_ns": {
"values": [],
"statistics": {}
},
"page_walk_penalty_ns": {
"values": [],
"statistics": {}
}
}
}Run cache/locality sweep script:
./script-examples/latency_test_script.shPlot extracted percentile trends from script-examples/final_output.txt:
python3 script-examples/plot_cache_percentiles.py script-examples/final_output.txt --metric medianSupported metrics: median, p90, p95, p99, average, min, max, stddev.
Note: script-examples/latency_test_script.sh invokes memory_benchmark from PATH. If you only built locally as ./memory_benchmark, either install it or update BENCHMARK_CMD in the script.
If other macOS programs are heavily active, treat results as contention-influenced.
Recommended interpretation approach:
- Keep background load profile consistent across compared runs.
- Use repeated loops (
-count 10or more). - Prioritize median and tail percentiles (
P95/P99) over single-run extremes. - Keep command lines identical when comparing machines or builds.
The measured bandwidth can be close to platform theoretical limits under favorable conditions. For example,
results/0.53.7/MacMiniM4_benchmark.json reports ~115.87 GB/s average main-memory read on Apple M4 versus
~120 GB/s theoretical peak (about 97%). Treat this as an empirical reference, not a guaranteed ceiling.
Reference sample result files in this repository:
results/0.53.7/MacMiniM4_benchmark.jsonresults/0.53.7/MacMiniM4_patterns.json
- User Manual: full usage guide, option reference, workflows, troubleshooting.
- Technical Specification: architecture, execution flow, memory model, output contracts.
- Latency Whitepaper: dependent pointer-chase design, chain construction, and sampling methodology.
- TLB Analysis Whitepaper: standalone
-analyze-tlbmethodology, boundary/guard rules, confidence model, and JSON verification contract.
-non-cacheableis best effort only (madvisehints); it does not create true uncached mappings.- Small buffers can be cache-dominated, so they may not represent DRAM behavior.
- Apple Silicon user space has no explicit data-cache flush primitive equivalent to x86
CLFLUSHfor strict cold-cache control. - TLB-locality mode controls pointer-chain construction policy; it does not directly control hardware TLB residency.
- Background activity, thermals, and scheduling can materially affect tails and variance.