Skip to content

[cpu][x86] GEMM vectorization example#74

Open
adam-smnk wants to merge 9 commits intollvm:mainfrom
adam-smnk:cpu-gemm-schedule
Open

[cpu][x86] GEMM vectorization example#74
adam-smnk wants to merge 9 commits intollvm:mainfrom
adam-smnk:cpu-gemm-schedule

Conversation

@adam-smnk
Copy link
Member

Adds x86-specific vectorization example for matrix multiplication.
Comes with a collection of opinionated but reusable transforms and schedules.

The lowering schedule currently supports F32 (general) and BF16 (avx512, flat layout) matmuls.

@adam-smnk adam-smnk marked this pull request as draft March 13, 2026 15:48
@adam-smnk
Copy link
Member Author

Needs changes from #65 related to using multiple schedules in a workload.

The added transform module aims to provide small reusable transform "bundles" to simplify writing schedules.
The schedules ended up mostly wrapping the transforms to hide the schedule creation boilerplate.
Finally, the matmul example create a vectorization lowering using these building blocks plus a few problem specific bits that I didn't feel are generic enough for reuse.

All these helpers are opinionated by design, mostly modeled by what is needed for the example. The APIs could probably be refined. Also, schedules ended up being mostly simple wrappers around the transform bundles. Perhaps, it's not worth having both modules.
Open to suggestions.

@adam-smnk
Copy link
Member Author

Reworked transform module to provide simple APIs over transform ops.
Schedule module now takes care of op matching to provide simple reusable rewrites.

@adam-smnk adam-smnk marked this pull request as ready for review March 16, 2026 15:29
Copy link
Member

@rengolin rengolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I wanted to add a python file as a schedule was to be able to reuse all of those new schedules you created and added to the lighthouse scope. We can discuss that later.

Some comments inline.

if dtype == ml_dtypes.bfloat16:
# For BF16, enforce fixed tile size due to current rewriter pattern matching limitation.
# TODO: Relax when x86 BF16 pass supports dynamic indexing.
tile_size = 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a warning message (stderr?) saying you did this, to avoid surprises.

dump_payload=args.dump_kernel,
dump_schedule=args.dump_schedule,
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for else here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's here to make it either print or execute.
Explicit sys.exit might be clearer.

Comment on lines +93 to +96
sched = lh_schedule.create_schedule()
named_seq = lh_schedule.create_named_sequence(
sched, input_types=[transform.any_op_t()]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is captured in @schedule_boilerplate. We should either always use the decorator or get rid of it.

Comment on lines +58 to +65
def linalg_to_category() -> ir.Module:
"""
Morph all linalg ops to category ops.

Returns:
Schedule
"""
return linalg_morph(generic_to_category=True, named_to_category=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of such small wrappers, which are an alias for a single call? Do you expect they can grow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These act purely as aliases for readability. Fixed functionality instead of having to provide and scan args.

In this case, I anticipate that normalizing linalg to specific abstraction would be desired and/or common enough to "justify" these. But they could be removed too.

@fschlimb
Copy link
Contributor

The APIs could probably be refined. Also, schedules ended up being mostly simple wrappers around the transform bundles. Perhaps, it's not worth having both modules.

In my mind we should not add abstractions/wrappers unless they really simplify something beyond saving a few characters. I find single-operation wrappers are more confusing than helpful. Many of the schedules and transforms in this PR are of the one-operation kind and I suggest removing them. Even things like one-op+cleanup don't seem to really add anything worth a wrapper - even if the cleanup is mandatory (which ideally it should not) the benefit is debatable.

@fschlimb
Copy link
Contributor

Also, I would appreciate if we had a single way of expressing and using a compiler pipeline (whether they are composed of schedules or passes or whatnot). Can we use the new Stage stuff?

Personally, I prefer a simple list of things, which the evaluator dispatches in the right way. No need for extra abstractions like "PassStage" or "TransformsStage", Python has all the features to do that. As @adam-smnk also mentioned elsewhere, we should be adding abstraction only very sparingly.

@adam-smnk
Copy link
Member Author

One potential use case of the predefined schedules is to make them accessible directly via lh-opt or the upcoming YAML spec. These schedules essentially act as small bundles of transforms instead of passes.

The specific granularity is up for debate of course.

@adam-smnk
Copy link
Member Author

Also, I would appreciate if we had a single way of expressing and using a compiler pipeline (whether they are composed of schedules or passes or whatnot). Can we use the new Stage stuff?

I guess that'd require further Workload refactoring to integrate the pipeline abstraction.
Overall, +1 and probably needs to be done. But definitely out of scope here.

memory_writes = self.M * self.N * nbytes # write C
return (flop_count, memory_reads, memory_writes)

def payload_module(self) -> ir.Module:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the payload is simple we could reuse mlir_gen utils here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants