[ET-VK][runtime] Add prepack cache to avoid duplicate weight prepacking by SS-JIA · Pull Request #18361 · pytorch/executorch

SS-JIA · 2026-03-20T06:14:19Z

Stack from ghstack (oldest at bottom):

-> [ET-VK][runtime] Add prepack cache to avoid duplicate weight prepacking #18361
[ET-VK][embedding] Enable embedding weight dedup with tied linear weights #18360

When embedding and linear ops share tied weights and both use the same
prepacking function (prepack_quantized_linear_weight), the weight gets
prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed
by (input ValueRef, kernel name) that returns the already-prepacked
tensor on cache hit, avoiding the duplicate allocation.

Differential Revision: D97430801

When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) [ghstack-poisoned]

pytorch-bot · 2026-03-20T06:14:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18361

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 28 Pending

As of commit ccda364 with merge base 38b40bc ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) ghstack-source-id: 355089157 Pull Request resolved: #18361

github-actions · 2026-03-20T06:15:28Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ht prepacking" When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) [ghstack-poisoned]

Pull Request resolved: #18361 When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. ghstack-source-id: 355234968 @exported-using-ghexport Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/)

…ht prepacking" When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) [ghstack-poisoned]

Pull Request resolved: #18361 When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. ghstack-source-id: 355269010 @exported-using-ghexport Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/)

…ht prepacking" When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) [ghstack-poisoned]

Pull Request resolved: #18361 When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. ghstack-source-id: 355353466 @exported-using-ghexport Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/)

SS-JIA mentioned this pull request Mar 20, 2026

[ET-VK][embedding] Enable embedding weight dedup with tied linear weights #18360

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 20, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][runtime] Add prepack cache to avoid duplicate weight prepacking#18361

[ET-VK][runtime] Add prepack cache to avoid duplicate weight prepacking#18361
SS-JIA wants to merge 4 commits intogh/SS-JIA/499/basefrom
gh/SS-JIA/499/head

SS-JIA commented Mar 20, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SS-JIA commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18361

⏳ No Failures, 28 Pending

Uh oh!

github-actions bot commented Mar 20, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SS-JIA commented Mar 20, 2026 •

edited

Loading

pytorch-bot bot commented Mar 20, 2026 •

edited

Loading

This PR needs a `release notes:` label