Prerequisites
Feature Description
Hardware-agnostic FlashAttention 2 implementation No compilation required. Works on any GPU.
Motivation
Is it possible to implement? The current flash attention implementation used in SD.cpp/GGML engine degrades performance on AMD terribly(Vulkan), as discussed here: leejet/stable-diffusion.cpp#1186
https://github.com/AuleTechnologies/Aule-Attention
Aule-attention provides a drop-in FlashAttention implementation that works across all major GPU vendors without requiring compilation at install time. It automatically selects the optimal backend for your hardware:
Triton: For AMD ROCm and NVIDIA CUDA (training and inference)
Vulkan: For Intel, Apple, AMD consumer GPUs, and any Vulkan-capable device (inference)
CPU: NumPy fallback for systems without GPU support.
Possible Implementation
No response
Prerequisites
Feature Description
Hardware-agnostic FlashAttention 2 implementation No compilation required. Works on any GPU.
Motivation
Is it possible to implement? The current flash attention implementation used in SD.cpp/GGML engine degrades performance on AMD terribly(Vulkan), as discussed here: leejet/stable-diffusion.cpp#1186
https://github.com/AuleTechnologies/Aule-Attention
Aule-attention provides a drop-in FlashAttention implementation that works across all major GPU vendors without requiring compilation at install time. It automatically selects the optimal backend for your hardware:
Triton: For AMD ROCm and NVIDIA CUDA (training and inference)
Vulkan: For Intel, Apple, AMD consumer GPUs, and any Vulkan-capable device (inference)
CPU: NumPy fallback for systems without GPU support.
Possible Implementation
No response