<EXPERIMENTAL - DO NOT REVIEW> DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation#18350
<EXPERIMENTAL - DO NOT REVIEW> DYNAMIC_UNBOUND support for portable runtime: lazy KV cache allocation#18350
Conversation
Enable DYNAMIC_UNBOUND tensors in the portable runtime, allowing KV cache buffers to be dynamically managed rather than statically memory-planned. This is the architectural foundation for pay-as-you-go memory allocation in ExecuTorch LLM inference. Core changes: - DynamicAllocator interface with allocate/reallocate/free - PalDynamicAllocator default impl (PAL-backed, 2x growth policy) - TrackingDynamicAllocator for memory stats observability - MemoryManager gains 4th slot for DynamicAllocator (backward compatible) - TensorImpl gains dynamic_allocator_ and capacity_bytes_ fields - TensorImpl::internal_resize_contiguous handles DYNAMIC_UNBOUND resize - tensor_parser_portable.cpp: remove DYNAMIC_UNBOUND rejection, wire up allocator at load time for tensors with no memory-planned data - method.cpp: FreeCall frees dynamic memory; destructor cleans up all - Module API auto-creates PalDynamicAllocator (DYNAMIC_UNBOUND just works) Export changes: - MarkDynamicUnboundPass marks KV cache buffers as DYNAMIC_UNBOUND - --lazy_kv_cache flag for Llama export Co-authored-by: Claude <noreply@anthropic.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18350
Note: Links to docs will display an error until the docs builds have been completed. ❌ 9 New Failures, 1 Unrelated FailureAs of commit f0b5b5f with merge base 02bad9d ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
| default=False, | ||
| help="Mark KV cache buffers as DYNAMIC_UNBOUND so they are allocated " | ||
| "lazily at runtime instead of at load time. Reduces initial memory " | ||
| "usage when max_context_length is large.", |
There was a problem hiding this comment.
is this because we do actually touch the full memory during attention?
Enable DYNAMIC_UNBOUND tensors in the portable runtime, allowing KV cache buffers to be dynamically managed rather than statically memory-planned. This is the architectural foundation for pay-as-you-go memory allocation in ExecuTorch LLM inference.
Core changes:
Export changes:
Summary
[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.
[PLEASE REMOVE] If this PR closes an issue, please add a
Fixes #<issue-id>line.[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.
Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.