Skip to content

<EXPERIMENTAL - DO NOT REVIEW> DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation#18350

Draft
psiddh wants to merge 1 commit intomainfrom
dynamic_unbound_kv_cache
Draft

<EXPERIMENTAL - DO NOT REVIEW> DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation#18350
psiddh wants to merge 1 commit intomainfrom
dynamic_unbound_kv_cache

Conversation

@psiddh
Copy link
Contributor

@psiddh psiddh commented Mar 19, 2026

Enable DYNAMIC_UNBOUND tensors in the portable runtime, allowing KV cache buffers to be dynamically managed rather than statically memory-planned. This is the architectural foundation for pay-as-you-go memory allocation in ExecuTorch LLM inference.

Core changes:

  • DynamicAllocator interface with allocate/reallocate/free
  • PalDynamicAllocator default impl (PAL-backed, 2x growth policy)
  • TrackingDynamicAllocator for memory stats observability
  • MemoryManager gains 4th slot for DynamicAllocator (backward compatible)
  • TensorImpl gains dynamic_allocator_ and capacity_bytes_ fields
  • TensorImpl::internal_resize_contiguous handles DYNAMIC_UNBOUND resize
  • tensor_parser_portable.cpp: remove DYNAMIC_UNBOUND rejection, wire up allocator at load time for tensors with no memory-planned data
  • method.cpp: FreeCall frees dynamic memory; destructor cleans up all
  • Module API auto-creates PalDynamicAllocator (DYNAMIC_UNBOUND just works)

Export changes:

  • MarkDynamicUnboundPass marks KV cache buffers as DYNAMIC_UNBOUND
  • --lazy_kv_cache flag for Llama export

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

Enable DYNAMIC_UNBOUND tensors in the portable runtime, allowing KV cache
buffers to be dynamically managed rather than statically memory-planned.
This is the architectural foundation for pay-as-you-go memory allocation
in ExecuTorch LLM inference.

Core changes:
- DynamicAllocator interface with allocate/reallocate/free
- PalDynamicAllocator default impl (PAL-backed, 2x growth policy)
- TrackingDynamicAllocator for memory stats observability
- MemoryManager gains 4th slot for DynamicAllocator (backward compatible)
- TensorImpl gains dynamic_allocator_ and capacity_bytes_ fields
- TensorImpl::internal_resize_contiguous handles DYNAMIC_UNBOUND resize
- tensor_parser_portable.cpp: remove DYNAMIC_UNBOUND rejection, wire up
  allocator at load time for tensors with no memory-planned data
- method.cpp: FreeCall frees dynamic memory; destructor cleans up all
- Module API auto-creates PalDynamicAllocator (DYNAMIC_UNBOUND just works)

Export changes:
- MarkDynamicUnboundPass marks KV cache buffers as DYNAMIC_UNBOUND
- --lazy_kv_cache flag for Llama export

Co-authored-by: Claude <noreply@anthropic.com>
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 19, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18350

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures, 1 Unrelated Failure

As of commit f0b5b5f with merge base 02bad9d (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2026
@psiddh psiddh changed the title DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation <EXPERIMENTAL - DO NOT REVIEW> DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation Mar 19, 2026
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

default=False,
help="Mark KV cache buffers as DYNAMIC_UNBOUND so they are allocated "
"lazily at runtime instead of at load time. Reduces initial memory "
"usage when max_context_length is large.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this because we do actually touch the full memory during attention?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants