fix(config): handle corrupted config.toml in credential functions by deanq · Pull Request #481 · runpod/runpod-python

deanq · 2026-03-20T23:50:49Z

Prerequisite: #479

Summary

get_credentials() now catches (TypeError, ValueError) when toml.load() fails on a corrupted ~/.runpod/config.toml, returning None instead of crashing. This matches the existing pattern in check_credentials().
set_credentials() (non-overwrite path) gets the same treatment -- a corrupted file is treated as empty rather than crashing.

Root cause: get_credentials() is called at import time in runpod/__init__.py:81. A malformed config.toml causes a TOMLDecodeError (subclass of ValueError) that crashes any downstream package (e.g. flash) before their error handling can run.

Fixes: AE-2576
Related: AE-2485 (fixed Flash's own read path; this fixes the dependency's read path)

Test plan

test_get_credentials_corrupted_toml -- ValueError returns None
test_get_credentials_type_error -- TypeError returns None
test_set_credentials_corrupted_toml_allows_overwrite -- overwrite=True ignores corrupt file
test_set_credentials_corrupted_toml_no_overwrite -- overwrite=False treats corrupt as empty
Existing tests unchanged and passing

Replaces the existing CI-e2e.yml which depends on external mock-worker repo and opaque test-runner action. New design uses flash run dev server with purpose-built fixtures to validate SDK behaviors end-to-end.

15 tasks across 4 chunks: fixture project, QB tests, LB tests + CI workflows, and local validation. Reviewed and approved.

Smoke testing revealed several issues: - flash run QB routes dispatch remotely via @endpoint decorator, requiring RUNPOD_API_KEY for all tests (not just LB) - Request body format must match handler param names (input_data, not prompt) - Health check must catch ConnectTimeout in addition to ConnectError - lb_endpoint.py startScript had wrong handler path (/src/ vs /app/) - asyncio_runner.Job requires (endpoint_id, job_id, session, headers), replaced with asyncio_runner.Endpoint - Cold start test uses dedicated port (8199) to avoid conflicts - CI-e2e.yml now requires RUNPOD_API_KEY secret and has 15min timeout - HTTP client timeout increased to 120s for remote dispatch latency

- Remove unused import runpod from test_lb_dispatch.py - Narrow bare Exception catch to (TypeError, ValueError, RuntimeError) - Extract _wait_for_ready to conftest as wait_for_ready with poll_interval param - Replace assert with pytest.fail in verify_local_runpod fixture - Move _patch_runpod_base_url to conftest as autouse fixture (DRY) - Add named constants for ports and timeouts - Add status assertions on set calls in test_state_independent_keys

The previous install order (flash first, then editable with --no-deps) left aiohttp and other transitive deps missing because the editable build produced a different version identifier. Fix: install editable with full deps first, then flash, then re-overlay the editable.

The SDK's RunPodClient and AsyncEndpoint constructors check runpod.api_key at init time. The conftest patched endpoint_url_base but never set api_key, causing RuntimeError for all SDK client tests. Also add response body to state test assertions for debugging the 500 errors from stateful_handler remote dispatch.

The CI unit test workflow runs `uv run pytest` without path restrictions, which collects e2e tests that require flash CLI. Add tests/e2e to norecursedirs so only CI-e2e.yml runs these tests (with explicit markers).

Flash's @Remote dispatch provisions serverless endpoints on first request (~60s cold start). Without warmup, early tests fail with 500 because endpoints aren't ready. Run concurrent warmup requests in the flash_server fixture to provision all 3 QB endpoints before tests. Also add response body to assertion messages for better debugging.

- Remove test_async_run: flash dev server's /run endpoint doesn't return Runpod API format ({"id": "..."}) needed for async job polling - Remove test_run_async_poll: same /run incompatibility - Redesign state tests: remote dispatch means in-memory state can't persist across calls, so test individual set/get operations instead - Add explicit timeout=120 to SDK run_sync() calls to prevent 600s hangs - Reduce per-test timeout from 600s to 180s so hanging tests don't block the entire suite - Increase job timeout from 15 to 20 min to accommodate endpoint warmup

- Increase HTTP_CLIENT_TIMEOUT from 120 to 180s to match per-test timeout, preventing httpx.ReadTimeout for slow remote dispatch - Add AttributeError to expected exceptions in test_run_sync_error (SDK raises AttributeError when run_sync receives None input)

Drop 3.8 and 3.9 support, add 3.12. Flash requires 3.10+ and the SDK should target the same range.

…atch The stateful_handler uses multi-param kwargs (action, key, value) which flash's remote dispatch returns 500 for. The other handlers use a single dict param and work correctly. Remove the stateful handler fixture and tests — the remaining 7 tests provide solid coverage of handler execution, SDK client integration, cold start, and error propagation.

- Patch requests.Session.request instead of .get/.post in 401 tests (RunPodClient._request uses session.request, not get/post directly) - Fix test_missing_api_key to test Endpoint creation with None key (was calling run() on already-created endpoint with valid key) - Increase cold start benchmark threshold from 1000ms to 2000ms (CI runners with shared CPUs consistently exceed 1000ms)

Remote dispatch via flash dev server occasionally hangs after first successful request. Adding --reruns 1 --reruns-delay 5 to both e2e workflows as a mitigation for transient timeout failures.

The test_sync_handler and test_async_handler tests hit the flash dev server directly with httpx, which consistently times out due to remote dispatch hanging after warmup. These handlers are already validated by the SDK-level tests (test_endpoint_client::test_run_sync and test_async_endpoint::test_async_run_sync) which pass reliably.

Flash remote dispatch intermittently hangs on sequential requests to different handlers. Consolidated to use async_handler for the happy-path SDK test and removed the redundant test_async_endpoint.py. Only one handler gets warmed up now, reducing provisioning time and eliminating the cross-handler dispatch stall pattern.

…dpoint autouse=True forced flash_server startup before all tests including test_cold_start, which takes ~60s on its own server. By the time test_run_sync ran, the provisioned endpoint had gone cold, causing 120s timeout failures in CI. - Remove autouse=True, rename to patch_runpod_globals - Add patch_runpod_globals to test_endpoint_client usefixtures - Increase SDK timeout 120s → 180s to match pytest per-test timeout

Add --log-cli-level=INFO to pytest command so flash's existing log.info() calls for endpoint provisioning, job creation, and status polling are visible in CI logs.

Stop piping flash subprocess stderr so provisioning logs (endpoint IDs, GraphQL mutations, job status) flow directly to CI output.

Provisioned serverless endpoints were running the published PyPI runpod-python, not the PR branch. Use PodTemplate.startScript to pip install the PR's git ref before the original start command. - Add e2e_template.py: reads RUNPOD_SDK_GIT_REF, builds PodTemplate with startScript that installs PR branch then runs handler - Update fixture handlers to pass template=get_e2e_template() - Set RUNPOD_SDK_GIT_REF in both CI workflows - Align nightly workflow env var name, add --log-cli-level=INFO

Replace flash-run-based e2e tests with direct endpoint provisioning using Flash's Endpoint(image=...) mode. Tests now provision real Runpod serverless endpoints running the mock-worker image, inject the PR's runpod-python via PodTemplate(dockerArgs=...), and validate SDK behavior against live endpoints. - Add tests.json test case definitions (basic, delay, generator, async_generator) - Add e2e_provisioner.py: reads tests.json, groups by hardwareConfig, provisions one endpoint per unique config - Add test_mock_worker.py: parametrized tests driven by tests.json - Rewrite conftest.py: remove flash-run fixtures, add provisioning fixtures - Make test_cold_start.py self-contained with own fixture directory - Simplify CI workflows: remove flash run/undeploy steps - Set FLASH_IS_LIVE_PROVISIONING=false to use ServerlessEndpoint (LiveServerless overwrites imageName with Flash base image) - Delete old flash-run fixtures and test files

Log endpoint provisioning details (name, image, dockerArgs, gpus), job submission/completion (job_id, output, error), and SDK version so CI output shows what is happening during e2e runs.

Call resource_config.undeploy() for each provisioned endpoint in the session teardown to avoid accumulating orphaned endpoints and templates on the Runpod account across CI runs.

Copilot

Pull request overview

This PR hardens credential config handling so corrupted ~/.runpod/config.toml won’t crash callers (notably when get_credentials() is invoked at import time), aligning behavior with existing check_credentials() error handling.

Changes:

Catch (TypeError, ValueError) around toml.load() in get_credentials() and return None on parse/type failures.
Treat corrupted config as empty in set_credentials() when overwrite=False, preventing crashes during existence checks.
Add unit tests covering corrupted TOML and type errors for get_credentials(), plus corrupted TOML behavior for set_credentials().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`runpod/cli/groups/config/functions.py`	Adds defensive exception handling for corrupted TOML in credential read/check paths.
`tests/test_cli/test_cli_groups/test_config_functions.py`	Adds tests for corrupted TOML/type errors and `set_credentials()` overwrite/non-overwrite behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_cli/test_cli_groups/test_config_functions.py

Copilot · 2026-03-20T23:54:12Z

tests/test_cli/test_cli_groups/test_config_functions.py

+        """set_credentials with overwrite=True ignores corrupted existing file."""
+        mock_toml_load.side_effect = ValueError("Invalid TOML")
+        # overwrite=True skips the toml.load check entirely
+        functions.set_credentials("NEW_KEY", overwrite=True)
+
+    @patch("runpod.cli.groups.config.functions.toml.load")
+    @patch("builtins.open", new_callable=mock_open())
+    def test_set_credentials_corrupted_toml_no_overwrite(
+        self, _mock_file, mock_toml_load
+    ):
+        """set_credentials without overwrite treats corrupted file as empty."""
+        mock_toml_load.side_effect = ValueError("Invalid TOML")
+        # Should not raise — corrupted file is treated as having no profiles
+        functions.set_credentials("NEW_KEY", overwrite=False)


set_credentials() calls os.makedirs(...) and Path(...).touch(...) before opening the file. These new tests patch open, but they don't patch os.makedirs / Path.touch or override functions.CREDENTIAL_FILE, so the test can still create ~/.runpod/config.toml on the real filesystem when it runs. Patch those filesystem calls (or redirect CREDENTIAL_FILE to a temp path) to keep the tests hermetic.

runpod-Henrik

1. Core fix — correct

get_credentials() now returns None on (TypeError, ValueError) instead of crashing. This matches the existing pattern in check_credentials() and unblocks the import-time crash in runpod/__init__.py:81. The set_credentials() non-overwrite path gets the same treatment — a corrupted file is treated as empty rather than raising.

2. Bug: `test_set_credentials_corrupted_toml_allows_overwrite` doesn't test the changed code

mock_toml_load.side_effect = ValueError("Invalid TOML")
functions.set_credentials("NEW_KEY", overwrite=True)

With overwrite=True, the if not overwrite: block is skipped entirely — toml.load is never called. The side_effect is never triggered. The test would pass with or without this PR. The test that actually validates the changed code is test_set_credentials_corrupted_toml_no_overwrite. The overwrite=True test is redundant for this change and should either be removed or repurposed to verify write behavior.

3. Issue: silent failure gives no signal to the user

except (TypeError, ValueError):
    return None

get_credentials() returning None on a corrupted file is the right behavior for the import-time crash, but it's silent. Downstream code in runpod/__init__.py sets _credentials = None with no indication that the config was present but unreadable. A user gets auth failures with no explanation. A warnings.warn() or sys.stderr message noting the file is corrupted would make this debuggable.

Nit

check_credentials() already had the same (TypeError, ValueError) pattern — worth a comment in get_credentials() cross-referencing this so future readers understand the intentional symmetry.

Verdict: PASS WITH NITS

Fix is correct. Item 2 (misleading test) and item 3 (silent failure) are worth a follow-up.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

- Redirect subprocess stdout/stderr to DEVNULL to prevent pipe deadlock - Guard send_signal with returncode check to avoid ProcessLookupError - Add explanatory comment to empty except clause (code scanning finding) - Remove machine-local absolute path from CLAUDE.md - Update spec status to reflect implementation is complete

- Submit all test jobs via asyncio.gather instead of serial parametrize - Exclude endpoint name from hardware_config_key so basic+delay share one endpoint (4 endpoints down to 3) - Reduce CI timeout from 20m to 15m

Previous teardown called resource_config.undeploy() via asyncio.get_event_loop().run_until_complete() which could fail silently due to event loop state during pytest session teardown. Switch to subprocess flash undeploy --all --force which runs in a clean process, reads .runpod/resources.pkl directly, and handles the full undeploy lifecycle reliably.

…credentials get_credentials() called toml.load() without exception handling, causing a raw TOMLDecodeError to crash any code that imports the runpod package when ~/.runpod/config.toml is malformed. Return None on parse failure, matching the existing pattern in check_credentials().

Address review feedback: - Patch os.path.exists via module path for consistency with existing tests - Patch os.makedirs and Path.touch in set_credentials tests for hermeticity

deanq added 30 commits March 13, 2026 16:26

docs: add flash-based e2e test design spec

27ae232

Replaces the existing CI-e2e.yml which depends on external mock-worker repo and opaque test-runner action. New design uses flash run dev server with purpose-built fixtures to validate SDK behaviors end-to-end.

docs: add flash-based e2e tests implementation plan

5df47c1

15 tasks across 4 chunks: fixture project, QB tests, LB tests + CI workflows, and local validation. Reviewed and approved.

chore: scaffold e2e test directory and fixture project

f5b9b76

feat: add QB fixture handlers for e2e tests

83cf285

feat: add LB fixture handler for e2e tests

ffc2556

chore: register e2e pytest markers (qb, lb, cold_start)

ba063bb

feat: add e2e conftest with flash_server lifecycle fixture

462e57e

feat: add e2e tests for sync and async QB handlers

ce36fb9

feat: add e2e tests for stateful worker persistence

8cfbc08

feat: add e2e tests for SDK Endpoint client round-trip

a739e01

feat: add e2e tests for async SDK Endpoint client

e449879

feat: add e2e cold start benchmark test

e20d41f

feat: add e2e tests for LB remote dispatch

aaaedf3

feat: replace CI-e2e.yml with flash-based QB e2e tests

ecec224

feat: add nightly CI workflow for full e2e suite including LB

62a804a

docs: update branch CLAUDE.md with implementation context

8b6c404

fix(ci): exclude tests/e2e from default pytest collection

9d152fe

The CI unit test workflow runs `uv run pytest` without path restrictions, which collects e2e tests that require flash CLI. Add tests/e2e to norecursedirs so only CI-e2e.yml runs these tests (with explicit markers).

fix(ci): update unit test matrix to Python 3.10-3.12

0e89ae9

Drop 3.8 and 3.9 support, add 3.12. Flash requires 3.10+ and the SDK should target the same range.

fix(ci): add pytest-rerunfailures for flaky remote dispatch timeouts

6b89edb

Remote dispatch via flash dev server occasionally hangs after first successful request. Adding --reruns 1 --reruns-delay 5 to both e2e workflows as a mitigation for transient timeout failures.

deanq added 8 commits March 13, 2026 21:38

fix(ci): surface flash provisioning logs in e2e test output

ce751fc

Add --log-cli-level=INFO to pytest command so flash's existing log.info() calls for endpoint provisioning, job creation, and status polling are visible in CI logs.

fix(e2e): surface flash server stderr to CI output

f2662e6

Stop piping flash subprocess stderr so provisioning logs (endpoint IDs, GraphQL mutations, job status) flow directly to CI output.

fix(e2e): add structured logging to provisioner and test execution

a8aa2e2

Log endpoint provisioning details (name, image, dockerArgs, gpus), job submission/completion (job_id, output, error), and SDK version so CI output shows what is happening during e2e runs.

feat(e2e): add endpoint cleanup after test session

1cb3489

Call resource_config.undeploy() for each provisioned endpoint in the session teardown to avoid accumulating orphaned endpoints and templates on the Runpod account across CI runs.

chore(ci): remove nightly e2e workflow

dd74099

deanq requested a review from Copilot March 20, 2026 23:51

Copilot started reviewing on behalf of deanq March 20, 2026 23:52 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

runpod-Henrik reviewed Mar 21, 2026

View reviewed changes

deanq added 5 commits March 20, 2026 19:29

perf(e2e): run jobs concurrently and consolidate endpoints

98ce4af

- Submit all test jobs via asyncio.gather instead of serial parametrize - Exclude endpoint name from hardware_config_key so basic+delay share one endpoint (4 endpoints down to 3) - Reduce CI timeout from 20m to 15m

fix(test): use module-scoped patches and mock filesystem calls

89973d0

Address review feedback: - Patch os.path.exists via module path for consistency with existing tests - Patch os.makedirs and Path.touch in set_credentials tests for hermeticity

deanq force-pushed the deanq/ae-2576-bug-corrupted-credentials-file-crash branch from 35df61a to 89973d0 Compare March 21, 2026 04:40

deanq mentioned this pull request Mar 21, 2026

fix(cli): catch corrupted credentials file at import time runpod/flash#282

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(config): handle corrupted config.toml in credential functions#481

fix(config): handle corrupted config.toml in credential functions#481
deanq wants to merge 43 commits intomainfrom
deanq/ae-2576-bug-corrupted-credentials-file-crash

deanq commented Mar 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

runpod-Henrik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

deanq commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

1. Core fix — correct

2. Bug: test_set_credentials_corrupted_toml_allows_overwrite doesn't test the changed code

3. Issue: silent failure gives no signal to the user

Nit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deanq commented Mar 20, 2026 •

edited

Loading

2. Bug: `test_set_credentials_corrupted_toml_allows_overwrite` doesn't test the changed code