[cpu][x86] GEMM vectorization example by adam-smnk · Pull Request #74 · llvm/lighthouse

adam-smnk · 2026-03-13T15:47:53Z

Adds x86-specific vectorization example for matrix multiplication.
Comes with a collection of opinionated but reusable transforms and schedules.

The lowering schedule currently supports F32 (general) and BF16 (avx512, flat layout) matmuls.

adam-smnk · 2026-03-13T16:50:57Z

Needs changes from #65 related to using multiple schedules in a workload.

The added transform module aims to provide small reusable transform "bundles" to simplify writing schedules.
The schedules ended up mostly wrapping the transforms to hide the schedule creation boilerplate.
Finally, the matmul example create a vectorization lowering using these building blocks plus a few problem specific bits that I didn't feel are generic enough for reuse.

All these helpers are opinionated by design, mostly modeled by what is needed for the example. The APIs could probably be refined. Also, schedules ended up being mostly simple wrappers around the transform bundles. Perhaps, it's not worth having both modules.
Open to suggestions.

lighthouse/transform/common.py

examples/cpu/x86/matmul.py

lighthouse/schedule/tiling.py

adam-smnk · 2026-03-16T14:15:03Z

Reworked transform module to provide simple APIs over transform ops.
Schedule module now takes care of op matching to provide simple reusable rewrites.

rengolin

The reason why I wanted to add a python file as a schedule was to be able to reuse all of those new schedules you created and added to the lighthouse scope. We can discuss that later.

Some comments inline.

examples/cpu/x86/matmul.py

rengolin · 2026-03-16T16:53:03Z

examples/cpu/x86/matmul.py

+        if dtype == ml_dtypes.bfloat16:
+            # For BF16, enforce fixed tile size due to current rewriter pattern matching limitation.
+            # TODO: Relax when x86 BF16 pass supports dynamic indexing.
+            tile_size = 32


perhaps a warning message (stderr?) saying you did this, to avoid surprises.

rengolin · 2026-03-16T16:56:04Z

examples/cpu/x86/matmul.py

+                dump_payload=args.dump_kernel,
+                dump_schedule=args.dump_schedule,
+            )
+        else:


no need for else here

It's here to make it either print or execute.
Explicit sys.exit might be clearer.

fschlimb · 2026-03-17T10:16:28Z

lighthouse/schedule/x86/pack_lowering.py

+    sched = lh_schedule.create_schedule()
+    named_seq = lh_schedule.create_named_sequence(
+        sched, input_types=[transform.any_op_t()]
+    )


This pattern is captured in @schedule_boilerplate. We should either always use the decorator or get rid of it.

fschlimb · 2026-03-17T10:19:18Z

lighthouse/schedule/linalg.py

+def linalg_to_category() -> ir.Module:
+    """
+    Morph all linalg ops to category ops.
+
+    Returns:
+        Schedule
+    """
+    return linalg_morph(generic_to_category=True, named_to_category=True)


What's the benefit of such small wrappers, which are an alias for a single call? Do you expect they can grow?

These act purely as aliases for readability. Fixed functionality instead of having to provide and scan args.

In this case, I anticipate that normalizing linalg to specific abstraction would be desired and/or common enough to "justify" these. But they could be removed too.

fschlimb · 2026-03-17T10:29:53Z

The APIs could probably be refined. Also, schedules ended up being mostly simple wrappers around the transform bundles. Perhaps, it's not worth having both modules.

In my mind we should not add abstractions/wrappers unless they really simplify something beyond saving a few characters. I find single-operation wrappers are more confusing than helpful. Many of the schedules and transforms in this PR are of the one-operation kind and I suggest removing them. Even things like one-op+cleanup don't seem to really add anything worth a wrapper - even if the cleanup is mandatory (which ideally it should not) the benefit is debatable.

fschlimb · 2026-03-17T10:39:13Z

Also, I would appreciate if we had a single way of expressing and using a compiler pipeline (whether they are composed of schedules or passes or whatnot). Can we use the new Stage stuff?

Personally, I prefer a simple list of things, which the evaluator dispatches in the right way. No need for extra abstractions like "PassStage" or "TransformsStage", Python has all the features to do that. As @adam-smnk also mentioned elsewhere, we should be adding abstraction only very sparingly.

adam-smnk · 2026-03-17T10:51:01Z

One potential use case of the predefined schedules is to make them accessible directly via lh-opt or the upcoming YAML spec. These schedules essentially act as small bundles of transforms instead of passes.

The specific granularity is up for debate of course.

adam-smnk · 2026-03-17T10:53:32Z

Also, I would appreciate if we had a single way of expressing and using a compiler pipeline (whether they are composed of schedules or passes or whatnot). Can we use the new Stage stuff?

I guess that'd require further Workload refactoring to integrate the pipeline abstraction.
Overall, +1 and probably needs to be done. But definitely out of scope here.

tkarna · 2026-03-17T11:04:17Z

examples/cpu/x86/matmul.py

+        memory_writes = self.M * self.N * nbytes  # write C
+        return (flop_count, memory_reads, memory_writes)
+
+    def payload_module(self) -> ir.Module:


Although the payload is simple we could reuse mlir_gen utils here.

adam-smnk marked this pull request as draft March 13, 2026 15:48

adam-smnk commented Mar 13, 2026

View reviewed changes

lighthouse/transform/common.py Outdated Show resolved Hide resolved

examples/cpu/x86/matmul.py Show resolved Hide resolved

adam-smnk added 2 commits March 16, 2026 11:08

Numpy helper

b787d40

CPU x86 GEMM example

8029fc4

adam-smnk force-pushed the cpu-gemm-schedule branch from 5f60821 to 8029fc4 Compare March 16, 2026 10:16

fschlimb reviewed Mar 16, 2026

View reviewed changes

lighthouse/schedule/tiling.py Outdated Show resolved Hide resolved

adam-smnk added 2 commits March 16, 2026 12:26

Refactor schedules and transforms

a19f90a

Split common into submodules

2c43efd

adam-smnk force-pushed the cpu-gemm-schedule branch from dad30a1 to 2c43efd Compare March 16, 2026 11:32

adam-smnk added 3 commits March 16, 2026 14:25

Move x86 pack lowering into submodule

50f362d

Refine docs

8abe52f

Refine schedule docs

724d300

Example cli flags

5ad41f2

adam-smnk marked this pull request as ready for review March 16, 2026 15:29

adam-smnk requested review from arun-thmn, rengolin, rolfmorel and tkarna March 16, 2026 15:29

Align block pack docs

ba58e63

rengolin reviewed Mar 16, 2026

View reviewed changes

fschlimb reviewed Mar 17, 2026

View reviewed changes

tkarna reviewed Mar 17, 2026

View reviewed changes

Conversation

adam-smnk commented Mar 13, 2026

Uh oh!

adam-smnk commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adam-smnk commented Mar 16, 2026

Uh oh!

rengolin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rengolin Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

rengolin Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

adam-smnk Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

fschlimb Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

fschlimb Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

adam-smnk Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

fschlimb commented Mar 17, 2026

Uh oh!

fschlimb commented Mar 17, 2026

Uh oh!

adam-smnk commented Mar 17, 2026

Uh oh!

adam-smnk commented Mar 17, 2026

Uh oh!

tkarna Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants