FP8 E4M3FN arithmetic library for Go, commonly used in quantized ML inference.
Part of the Zerfoo ML ecosystem.
- IEEE 754 FP8 E4M3FN format — 1 sign, 4 exponent, 3 mantissa bits
- Fast lookup tables — optional pre-computed tables for arithmetic and conversion
- Full arithmetic — add, subtract, multiply, divide, sqrt, abs, neg
- No infinities — the E4M3FN variant uses the infinity encoding for additional finite values
- Zero dependencies — pure Go, no CGo
go get github.com/zerfoo/float8Requires Go 1.26+.
package main
import (
"fmt"
"github.com/zerfoo/float8"
)
func main() {
a := float8.FromFloat32(3.14)
b := float8.FromFloat32(2.71)
sum := a.Add(b)
product := a.Mul(b)
fmt.Printf("a = %f\n", a.ToFloat32())
fmt.Printf("a + b = %f\n", sum.ToFloat32())
fmt.Printf("a * b = %f\n", product.ToFloat32())
}| Field | Bits | Description |
|---|---|---|
| Sign | 1 | 0 = positive, 1 = negative |
| Exponent | 4 | Biased by 7, range [-6, 7] |
| Mantissa | 3 | 3 explicit + 1 implicit leading bit |
Special values: ±0 (exp=0, mant=0), NaN (exp=1111, mant=111). No infinities.
// Enable lookup tables for faster arithmetic (trades memory for speed)
float8.EnableFastArithmetic()
float8.EnableFastConversion()- ztensor — GPU-accelerated tensor library
Apache 2.0