Skip to content

fix: handle HTTP 413 by splitting and retrying in OTLP HTTP exporters#5032

Open
Krishnachaitanyakc wants to merge 5 commits intoopen-telemetry:mainfrom
Krishnachaitanyakc:feat/retry-413-payload-splitting
Open

fix: handle HTTP 413 by splitting and retrying in OTLP HTTP exporters#5032
Krishnachaitanyakc wants to merge 5 commits intoopen-telemetry:mainfrom
Krishnachaitanyakc:feat/retry-413-payload-splitting

Conversation

@Krishnachaitanyakc
Copy link
Copy Markdown
Contributor

Summary

When a backend returns HTTP 413 (Payload Too Large), the OTLP HTTP trace and log exporters now split the batch in half and recursively retry each half, preventing silent data loss when batch sizes exceed backend limits.

Fixes #4533

Changes

  • Added _is_payload_too_large() helper in _common/__init__.py
  • Refactored export() to delegate to _export_batch() in both trace and log exporters
  • _export_batch() handles 413 responses with binary splitting:
    • Base case: single-item batch returns FAILURE (item is genuinely too large)
    • Deadline guard: if deadline expired, returns FAILURE without recursing
    • Short-circuit: if first half fails, second half is not attempted
    • Recursive split: halves the batch and exports each half independently

Notes

  • The metric exporter already has proactive batch splitting via max_export_batch_size and _split_metrics_data(). Reactive 413 handling for metrics is deferred to a follow-up since metric data has a nested protobuf structure that requires different splitting logic.
  • The gRPC exporter uses a different status code system (RESOURCE_EXHAUSTED) and would need separate handling in a future PR.

Test plan

  • 5 new span exporter tests: split success, single-item failure, recursive splitting, partial failure short-circuit, deadline expiry
  • 5 new log exporter tests: same scenarios
  • All existing tests pass (no regressions)
  • ruff linter passes

…rying

When a backend returns HTTP 413 (Payload Too Large), the trace and log
exporters now split the batch in half and recursively retry each half.
This prevents silent data loss when batch sizes exceed backend limits.

The splitting includes deadline guards to prevent infinite recursion,
short-circuits on first-half failure to avoid wasting time on the
second half, and drops individual items that are genuinely too large.

Fixes open-telemetry#4533
- Add CHANGELOG.md entry for the 413 splitting feature
- Apply ruff format to source files (line wrapping adjustments)
- Rename loop variable 'i' to 'idx' to satisfy pylint naming convention
Relax assertAlmostEqual tolerance from 2 decimal places (0.005) to 1
(0.05) in timeout tests. The _export_batch refactoring adds a
serialization step between deadline calculation and the HTTP POST,
consuming a few extra milliseconds that exceed the tight tolerance on
slow runtimes like PyPy on Windows.
@Krishnachaitanyakc Krishnachaitanyakc marked this pull request as ready for review April 2, 2026 06:57
@Krishnachaitanyakc Krishnachaitanyakc requested a review from a team as a code owner April 2, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retry 413 / payload too large errors in OTLP batch exporter

1 participant