Skip to content

Fix metrics queue starvation on partial upload failures#636

Closed
svarlamov wants to merge 1 commit intomainfrom
codex/metrics-db-reliability-v2
Closed

Fix metrics queue starvation on partial upload failures#636
svarlamov wants to merge 1 commit intomainfrom
codex/metrics-db-reliability-v2

Conversation

@svarlamov
Copy link
Copy Markdown
Member

@svarlamov svarlamov commented Mar 4, 2026

Summary

  • return MetricsUploadResponse from upload_metrics_with_retry so callers can handle partial failures explicitly
  • fix flush-metrics-db to split successful vs rejected record IDs and delete both resolved sets, preventing poison rows from blocking the queue
  • harden metrics flush fallback behavior: only retry-buffer failed indices, keep full batch on invalid error indices, and retain log files when metrics cannot be uploaded or buffered
  • invoke flush-metrics-db from flush-logs so queued SQLite metrics are drained during regular flush runs
  • add targeted unit coverage for response-index mapping and failed-event extraction

Validation

  • cargo fmt
  • cargo test --lib test_split_record_ids_by_response -- --nocapture
  • cargo test --lib test_failed_metrics_events_for_retry -- --nocapture
  • cargo test

Open with Devin

@git-ai-cloud-dev
Copy link
Copy Markdown

git-ai-cloud-dev Bot commented Mar 4, 2026

Stats powered by Git AI

🧠 you    ████████████████████  100%
🤖 ai     ░░░░░░░░░░░░░░░░░░░░  0%
More stats
  • 0.0 lines generated for every 1 accepted
  • 0 seconds waiting for AI

AI code tracked with git-ai

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment on lines +163 to +168
user_log!(
" ! batch {} - uploaded {} events, {} event(s) failed and were kept for retry",
total_batches,
successful_count,
failed_count
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Log message says failed events are "kept for retry" but they are immediately deleted

The user-facing log message at line 164 says "{} event(s) failed and were kept for retry", but the code at lines 178-191 deletes both successful_ids and failed_ids from the database. The comment at line 180-181 even explicitly confirms this: "Rejected events are validation failures and will not succeed on retry." The total_discarded counter correctly tracks these as discarded (line 154), and the final summary at line 222 correctly says "discarded {} rejected events". But the per-batch message tells the user the opposite of what actually happens — that those events were kept for retry — which is misleading for anyone debugging queue behavior.

Suggested change
user_log!(
" ! batch {} - uploaded {} events, {} event(s) failed and were kept for retry",
total_batches,
successful_count,
failed_count
);
user_log!(
" ! batch {} - uploaded {} events, {} event(s) failed and were discarded (validation errors)",
total_batches,
successful_count,
failed_count
);
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@git-ai-bot-svarlamov-dev
Copy link
Copy Markdown

Stats powered by Git AI

🧠 you    ████████████████████  100%
🤖 ai     ░░░░░░░░░░░░░░░░░░░░  0%
More stats
  • 0.0 lines generated for every 1 accepted
  • 0 seconds waiting for AI

AI code tracked with git-ai

@svarlamov svarlamov closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant