Bump iOS XCTest timeout for ExecuTorchLLMTests#19354
Bump iOS XCTest timeout for ExecuTorchLLMTests#19354psiddh wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19354
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@psiddh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104147313. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR adjusts the Buck test configuration for the iOS LLM XCTest bundle to reduce spurious timeouts during long CPU-based simulator inference runs.
Changes:
- Switches the test label from
long_runningtoglacialto increase the per-XCTestCasetimeout tier. - Sets a larger rule-level wall-clock timeout for the generated test bundle via
test_test_rule_timeout_ms.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary: The 13 XCTestCase methods in `xplat/executorch/extension/llm/apple:ExecuTorchLLMTests` (testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their reset variants) regularly hit the 1800-second per-test ceiling enforced by `fbobjc/Tools/xctest_runner` for the `long_running` label. LLM inference on iOS-sim CPU (1B-class models, 128-768 token sequences, each test calls `generate()` twice) routinely exceeds 30 minutes per test method, producing spurious "Test timed out after 1800 seconds" flakes on the test-issues dashboard for owner `ai_infra_mobile_platform`. Per the runner formula `TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`: | label | multiplier | per-XCTestCase budget | |----------------|-----------:|----------------------:| | long_running | x10 | 1800s | | glacial (here) | x30 | 5400s | Switching to `glacial` (the highest tier supported by the runner) gives each test 90 minutes. Adding `test_test_rule_timeout_ms = 28800000` sets the bundle-level wall-clock budget to 8h, which is comfortable headroom for ~5 testcases at 90 min each plus xctest setup/teardown. Note: this diff is unrelated to T269848646. T269848646 tracks a separate cluster of 446 iOS-sim test-run *cancellations* (`duration: 0.00`, "test execution was cancelled because the test run was cancelled") that is owned by testinfra and is not addressed here. Differential Revision: D104147313
Summary: The 13 XCTestCase methods in `xplat/executorch/extension/llm/apple:ExecuTorchLLMTests` (testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their reset variants) regularly hit the 1800-second per-test ceiling enforced by `fbobjc/Tools/xctest_runner` for the `long_running` label. LLM inference on iOS-sim CPU (1B-class models, 128-768 token sequences, each test calls `generate()` twice) routinely exceeds 30 minutes per test method, producing spurious "Test timed out after 1800 seconds" flakes on the test-issues dashboard for owner `ai_infra_mobile_platform`. Per the runner formula `TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`: | label | multiplier | per-XCTestCase budget | |----------------|-----------:|----------------------:| | long_running | x10 | 1800s | | glacial (here) | x30 | 5400s | Switching to `glacial` (the highest tier supported by the runner) gives each test 90 minutes. Adding `test_test_rule_timeout_ms = 28800000` sets the bundle-level wall-clock budget to 8h, which is comfortable headroom for ~5 testcases at 90 min each plus xctest setup/teardown. Note: this diff is unrelated to T269848646. T269848646 tracks a separate cluster of 446 iOS-sim test-run *cancellations* (`duration: 0.00`, "test execution was cancelled because the test run was cancelled") that is owned by testinfra and is not addressed here. Differential Revision: D104147313
Summary: The 13 XCTestCase methods in `xplat/executorch/extension/llm/apple:ExecuTorchLLMTests` (testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their reset variants) regularly hit the 1800-second per-test ceiling enforced by `fbobjc/Tools/xctest_runner` for the `long_running` label. LLM inference on iOS-sim CPU (1B-class models, 128-768 token sequences, each test calls `generate()` twice) routinely exceeds 30 minutes per test method, producing spurious "Test timed out after 1800 seconds" flakes on the test-issues dashboard for owner `ai_infra_mobile_platform`. Per the runner formula `TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`: | label | multiplier | per-XCTestCase budget | |----------------|-----------:|----------------------:| | long_running | x10 | 1800s | | glacial (here) | x30 | 5400s | Switching to `glacial` (the highest tier supported by the runner) gives each test 90 minutes. Adding `test_test_rule_timeout_ms = 28800000` sets the bundle-level wall-clock budget to 8h, which is comfortable headroom for ~5 testcases at 90 min each plus xctest setup/teardown. Note: this diff is unrelated to T269848646. T269848646 tracks a separate cluster of 446 iOS-sim test-run *cancellations* (`duration: 0.00`, "test execution was cancelled because the test run was cancelled") that is owned by testinfra and is not addressed here. Differential Revision: D104147313
Summary: The 13 XCTestCase methods in `xplat/executorch/extension/llm/apple:ExecuTorchLLMTests` (testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their reset variants) regularly hit the 1800-second per-test ceiling enforced by `fbobjc/Tools/xctest_runner` for the `long_running` label. LLM inference on iOS-sim CPU (1B-class models, 128-768 token sequences, each test calls `generate()` twice) routinely exceeds 30 minutes per test method, producing spurious "Test timed out after 1800 seconds" flakes on the test-issues dashboard for owner `ai_infra_mobile_platform`. Per the runner formula `TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`: | label | multiplier | per-XCTestCase budget | |----------------|-----------:|----------------------:| | long_running | x10 | 1800s | | glacial (here) | x30 | 5400s | Switching to `glacial` (the highest tier supported by the runner) gives each test 90 minutes. Adding `test_test_rule_timeout_ms = 14400000` sets the bundle-level wall-clock budget to 4h, which is comfortable headroom for ~5 testcases at 90 min each plus xctest setup/teardown. Note: this diff is unrelated to T269848646. T269848646 tracks a separate cluster of 446 iOS-sim test-run *cancellations* (`duration: 0.00`, "test execution was cancelled because the test run was cancelled") that is owned by testinfra and is not addressed here. Reviewed By: shoumikhin Differential Revision: D104147313
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| # Rule-level wall-clock for the whole auto-generated test bundle: | ||
| # ExecuTorchLLMTests currently contains 13 XCTestCase methods, and | ||
| # individual methods can exceed 30 minutes on iOS-sim CPU. This 4h | ||
| # budget is intended as the total bundle/shard wall-clock, including |
Summary:
The 13 XCTestCase methods in
xplat/executorch/extension/llm/apple:ExecuTorchLLMTests(testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their
reset variants) regularly hit the 1800-second per-test ceiling
enforced by
fbobjc/Tools/xctest_runnerfor thelong_runninglabel. LLM inference on iOS-sim CPU (1B-class models,
128-768 token sequences, each test calls
generate()twice)routinely exceeds 30 minutes per test method, producing spurious
"Test timed out after 1800 seconds" flakes on the test-issues
dashboard for owner
ai_infra_mobile_platform.Per the runner formula
TEST_CASE_TIMEOUT(60s) * label_multiplier * 3:Switching to
glacial(the highest tier supported by the runner)gives each test 90 minutes. Adding
test_test_rule_timeout_ms = 14400000sets the bundle-levelwall-clock budget to 4h, which is comfortable headroom for ~5
testcases at 90 min each plus xctest setup/teardown.
Note: this diff is unrelated to T269848646. T269848646 tracks a
separate cluster of 446 iOS-sim test-run cancellations
(
duration: 0.00, "test execution was cancelled because the testrun was cancelled") that is owned by testinfra and is not
addressed here.
Reviewed By: shoumikhin
Differential Revision: D104147313