Skip to content

Multiprocess Parallel Random Data Generation for Benchmark Serving.#1038

Open
Duyi-Wang wants to merge 1 commit intoSemiAnalysisAI:mainfrom
Duyi-Wang:mp_benchmark
Open

Multiprocess Parallel Random Data Generation for Benchmark Serving.#1038
Duyi-Wang wants to merge 1 commit intoSemiAnalysisAI:mainfrom
Duyi-Wang:mp_benchmark

Conversation

@Duyi-Wang
Copy link
Copy Markdown

Summary

Accelerate random prompt generation in benchmark_serving.py by parallelizing the sample_random_requests() function using Python multiprocessing.Pool. This addresses the bottleneck where generating large numbers of long prompts (e.g., 20K+ prompts at 8K+ input tokens) takes tens of minutes due to sequential tokenizer encode/decode operations.

Problem

When running benchmarks with high concurrency and long input sequences, the data preparation phase dominates total wall time. For example:

  • 2048 concurrency × 10 = 20,480 prompts @ 8,192 input tokens: the original serial path would take ~25 minutes just to generate prompt data before any actual benchmarking begins.

The root cause is that each prompt requires multiple tokenizer.decode()tokenizer.encode() round-trips (up to 10 retries) to calibrate token length, and this entire loop runs sequentially in a single process.

Solution

  • Added multiprocessing support to sample_random_requests() via multiprocessing.Pool
  • Each worker process initializes its own tokenizer instance once (via Pool(initializer=...))
  • The prompt generation workload is split into chunks and distributed across workers
  • Added --random-num-workers CLI argument (grouped with other --random-* options):
    • 0 (default): auto-select min(cpu_count, 8) workers
    • 1: force serial execution (original behavior, full backward compatibility)
    • N: use exactly N worker processes
  • New parameters (tokenizer_id, tokenizer_mode, trust_remote_code, num_workers) added to sample_random_requests() function signature; all are optional with backward-compatible defaults

Test Results

Tested with DeepSeek-R1 tokenizer (vocab_size=128,000), input_len=8192, output_len=1024, range_ratio=0.8:

Correctness Verification (2,048 prompts, serial vs parallel)

Metric Result
Prompt length exact match 2048/2048 (100.0%)
Output length exact match 2048/2048 (100.0%)
Prompt text exact match 1949/2048 (95.2%)
Prompt length mean diff 0.00
Output lengths identical True
Overall PASS

Note: ~5% prompt text difference is expected — the retry loop uses random token padding, and multiprocessing workers use independent RNG states. However, all prompt/output lengths match exactly, which is what matters for benchmark accuracy.

Performance (8 worker processes)

Scenario Serial Parallel (8 workers) Speedup
2,048 prompts × 8K input 150.88s 24.74s 6.10x
20,480 prompts × 8K input ~1,508s (est.) 228.37s ~6.6x

Statistical Consistency

Serial   prompt_len: mean=7379.3  std=478.8  min=6553  max=8192
Parallel prompt_len: mean=7379.3  std=478.8  min=6553  max=8192

Serial   output_len: mean=920.6  std=60.2  min=819  max=1024
Parallel output_len: mean=920.6  std=60.2  min=819  max=1024

Files Changed

  • utils/bench_serving/benchmark_serving.py — Added multiprocessing support for prompt generation (+152/-28 lines)

Usage

# Default: auto-parallel with up to 8 workers (no change needed to existing scripts)
python benchmark_serving.py --dataset-name random --random-input-len 8192 --num-prompts 20480 ...

# Explicit worker count
python benchmark_serving.py --dataset-name random --random-num-workers 16 ...

# Force serial (original behavior)
python benchmark_serving.py --dataset-name random --random-num-workers 1 ...

Reproducibility Verification

The parallel path is fully deterministic: given the same --seed and --random-num-workers, multiple runs produce byte-identical results.

Verified by running 3 consecutive executions with seed=0, num_workers=4, num_prompts=200, input_len=1024 and computing MD5 over all prompt texts and lengths:

Run 1: hash=28004a3db05c9f6b98cc8169405f83c0, prompt_lens_sum=184378
Run 2: hash=28004a3db05c9f6b98cc8169405f83c0, prompt_lens_sum=184378
Run 3: hash=28004a3db05c9f6b98cc8169405f83c0, prompt_lens_sum=184378
All identical: True

This is guaranteed because:

  1. The main process np.random.seed() makes input_lens, output_lens, offsets, and per-worker seeds all deterministic
  2. Each worker creates an independent np.random.RandomState(seed) with its assigned fixed seed
  3. pool.map() returns results in chunk order (not completion order)

Note: changing --random-num-workers will change per-worker seed assignments, so results will differ from serial mode or a different worker count. However, the prompt/output length distributions remain statistically identical across any worker configuration.

Backward Compatibility

  • Default behavior changes from serial to parallel, but results are statistically equivalent
  • --random-num-workers 1 preserves exact original behavior
  • No changes to benchmark output format or metrics calculation
  • No new package dependencies (uses stdlib multiprocessing)

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant