[ET-VK][conv1d] Implement height-packed pointwise conv1d operator by SS-JIA · Pull Request #18332 · pytorch/executorch

SS-JIA · 2026-03-19T19:14:59Z

Stack from ghstack (oldest at bottom):

Implement a new conv1d pointwise (kernel_size=1) operator using height-packed
layout where channels are the packed dimension (WHCN dim 1). This enables
dot-product reduction over input channels: each vec4 load gives 4 consecutive
channel values, yielding 4 MACs per dot() instruction.

Uses tiled computation with the FP tile infrastructure from linear/matmul
(FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and
4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for
cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows)
based on GPU occupancy.

Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes
TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed
weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA.

Supports both buffer and texture3d storage for input/output, texture2d or
buffer for packed weights, fp32/fp16, and optional bias. Registered as
et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking).

Performance on Adreno 750 (S24):

[1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s
[1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s
[1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s
[1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s

Differential Revision: D97344092

Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/) [ghstack-poisoned]

pytorch-bot · 2026-03-19T19:15:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18332

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit b6f0029 with merge base 9076110 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-19T19:19:01Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…perator" Implement a new conv1d pointwise (kernel_size=1) operator using height-packed layout where channels are the packed dimension (WHCN dim 1). This enables dot-product reduction over input channels: each vec4 load gives 4 consecutive channel values, yielding 4 MACs per dot() instruction. Uses tiled computation with the FP tile infrastructure from linear/matmul (FPInputTile, FPWeightTile, FPOutTile, fp_accumulate_with_fp_weight) and 4OC×4IC blocked weight packing via pack_fp_linear_weight.glsl for cache-friendly texture2d weight reads. Adaptive tile_m selection (4/2/1 rows) based on GPU occupancy. Thread mapping: X=OC4 tiles, Y=L tiles, Z=batch. Each thread computes TILE_M×TILE_N4×4 output elements. Inner loop loads input tiles and packed weight tiles, then calls fp_accumulate_with_fp_weight for tiled FMA. Supports both buffer and texture3d storage for input/output, texture2d or buffer for packed weights, fp32/fp16, and optional bias. Registered as et_vk.conv1d_pw.default (standalone custom op for testing/benchmarking). Performance on Adreno 750 (S24): - [1,256,1024]x[512,256,1] texture f16: 908 GFLOP/s - [1,512,2048]x[256,512,1] texture f16: 865 GFLOP/s - [1,128,4096]x[128,128,1] texture f16: 781 GFLOP/s - [1,256,1024]x[512,256,1] buffer f16: 491 GFLOP/s Differential Revision: [D97344092](https://our.internmc.facebook.com/intern/diff/D97344092/) [ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2026

This was referenced Mar 19, 2026

[ET-VK][conv1d] Implement height-packed depthwise conv1d operator #18333

Open

[ET-VK][conv1d] Route conv1d to height-packed implementations in export pipeline #18334

Open

[ET-VK][CI] Add test-vulkan-genai job for Parakeet on NVIDIA GPU runner #18335

Open

meta-codesync bot added fb-exported meta-exported labels Mar 19, 2026

ssjia added 2 commits March 19, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][conv1d] Implement height-packed pointwise conv1d operator#18332

[ET-VK][conv1d] Implement height-packed pointwise conv1d operator#18332
SS-JIA wants to merge 3 commits intogh/SS-JIA/494/basefrom
gh/SS-JIA/494/head

SS-JIA commented Mar 19, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SS-JIA commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18332

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

github-actions bot commented Mar 19, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SS-JIA commented Mar 19, 2026 •

edited

Loading

pytorch-bot bot commented Mar 19, 2026 •

edited

Loading

This PR needs a `release notes:` label