[SPARK-56089][SQL] Align asinh/acosh with fdlibm algorithm for cross-engine compatibility#54912
[SPARK-56089][SQL] Align asinh/acosh with fdlibm algorithm for cross-engine compatibility#54912xiaoxuandev wants to merge 1 commit intoapache:masterfrom
Conversation
| struct<ASINH(1):double> | ||
| -- !query output | ||
| 0.8813735870195429 | ||
| 0.881373587019543 |
There was a problem hiding this comment.
Checked with PostgreSQL.
| } else if (ax > 2.0) { | ||
| StrictMath.log(2.0 * ax + 1.0 / (math.sqrt(x * x + 1.0) + ax)) | ||
| } else { | ||
| StrictMath.log1p(ax + x * x / (1.0 + math.sqrt(1.0 + x * x))) |
There was a problem hiding this comment.
? here in fdmlib t = x * x
There was a problem hiding this comment.
Thank you both, fixed!
peter-toth
left a comment
There was a problem hiding this comment.
Looks good to me, I have only a nit.
mgaido91
left a comment
There was a problem hiding this comment.
just a couple of question and one suggestion
| // fdlibm s_asinh.c algorithm | ||
| val ax = Math.abs(x) | ||
| val w = if (ax.isInfinite || ax.isNaN) { | ||
| ax |
There was a problem hiding this comment.
in fbmlib they have x + x here IIUC. I do not see the rationale though.
There was a problem hiding this comment.
In fdlibm, x + x for the inf/NaN case serves two purposes: (1) it propagates signaling NaN signals in C, and (2) it preserves the sign of infinity (since asinh is an odd function).
In our implementation, we take Math.abs(x) which strips the sign, but Math.copySign(w, x) at the end restores it correctly for both +Inf and -Inf. For NaN, Math.copySign(NaN, x) still returns NaN. The signaling NaN distinction doesn't apply in the JVM, so the result is equivalent, just a different way to achieve the same thing.
| } else if (ax > 2.0) { | ||
| StrictMath.log(2.0 * ax + 1.0 / (math.sqrt(x * x + 1.0) + ax)) | ||
| } else { | ||
| StrictMath.log1p(ax + x * x / (1.0 + math.sqrt(1.0 + x * x))) |
There was a problem hiding this comment.
? here in fdmlib t = x * x
| |} else if (ax > 2.0) { | ||
| | w = $sm.log(2.0 * ax + 1.0 / (java.lang.Math.sqrt($c * $c + 1.0) + ax)); | ||
| |} else { | ||
| | w = $sm.log1p(ax + $c * $c / (1.0 + java.lang.Math.sqrt(1.0 + $c * $c))); |
There was a problem hiding this comment.
| | w = $sm.log1p(ax + $c * $c / (1.0 + java.lang.Math.sqrt(1.0 + $c * $c))); | |
| | double t = $c * $c; | |
| | w = $sm.log1p(ax + t / (1.0 + java.lang.Math.sqrt(1.0 + t))); |
| extends UnaryMathExpression((x: Double) => { | ||
| // fdlibm e_acosh.c algorithm | ||
| if (x < 1.0) { | ||
| Double.NaN |
There was a problem hiding this comment.
again, as a question, do you know why fdmlib here has (x - x) / (x - x)?
There was a problem hiding this comment.
(x - x) / (x - x) is an fdlibm idiom to produce NaN while also raising the IEEE 754 "invalid operation" exception signal. In C, code can detect this via fetestexcept(FE_INVALID). The JVM has no IEEE 754 exception flag mechanism, (x - x) / (x - x) and Double.NaN are functionally identical in Java/Scala. So we use Double.NaN here for clarity.
…engine compatibility ### What changes were proposed in this pull request? Replace the naive `log(x + sqrt(x^2 ± 1))` formula in `Asinh` and `Acosh` with the full fdlibm algorithm (`s_asinh.c` / `e_acosh.c`), as implemented in OpenJDK `FdLibm.java`. For `Acosh` (5 branches): - `x < 1`: NaN - `x >= 2^28`: `log(x) + log(2)` - `x == 1`: `0.0` - `2 < x < 2^28`: `log(2x - 1/(x + sqrt(x*x - 1)))` - `1 < x < 2`: `log1p(t + sqrt(2*t + t*t))` For `Asinh` (4 branches): - `|x| < 2^-28`: `x` (identity) - `|x| > 2^28`: `sign(x) * (log(|x|) + log(2))` - `|x| > 2`: `sign(x) * log(2|x| + 1/(|x| + sqrt(x*x + 1)))` - otherwise: `sign(x) * log1p(|x| + x*x/(1 + sqrt(1 + x*x)))` ### Why are the changes needed? The fdlibm algorithm (`s_asinh.c` / `e_acosh.c`) is the de facto standard used by OpenJDK, glibc, musl, Go, Python, PostgreSQL, and DuckDB. Spark's previous naive formula `log(x + sqrt(x^2 ± 1))` produces results that differ by 1-2 ULP from these platforms for certain input ranges, causing cross-engine inconsistencies. The fdlibm algorithm also avoids catastrophic cancellation near x=1 (acosh) and x=0 (asinh) by using range-specific formulas (log1p, identity). This also changes the large-value threshold from `sqrt(Double.MaxValue)` to `2^28`, matching the fdlibm standard. ### Does this PR introduce _any_ user-facing change? Results may differ by 1-2 ULP for inputs in the `(1, 2]` range (acosh) and `[-2, 2]` range (asinh) due to the more precise formulas. The new results are closer to the mathematically exact values and consistent with other platforms. ### How was this patch tested? Updated tests in `MathExpressionsSuite` with fdlibm-aligned reference functions, plus new tests covering all branches with hardcoded reference values cross-verified against C libm. ### Was this patch authored or co-authored using generative AI tooling? Yes, co-authored with Kiro.
What changes were proposed in this pull request?
Replace the naive
log(x + sqrt(x^2 ± 1))formula inAsinhandAcoshwith the full fdlibm algorithm (s_asinh.c/e_acosh.c), as implemented in OpenJDKFdLibm.java.For
Acosh(5 branches):x < 1: NaNx >= 2^28:log(x) + log(2)x == 1:0.02 < x < 2^28:log(2x - 1/(x + sqrt(x*x - 1)))1 < x < 2:log1p(t + sqrt(2*t + t*t))For
Asinh(4 branches):|x| < 2^-28:x(identity)|x| > 2^28:sign(x) * (log(|x|) + log(2))|x| > 2:sign(x) * log(2|x| + 1/(|x| + sqrt(x*x + 1)))sign(x) * log1p(|x| + x*x/(1 + sqrt(1 + x*x)))Why are the changes needed?
The fdlibm algorithm (
s_asinh.c/e_acosh.c) is the de facto standard used by OpenJDK, glibc, musl, Go, Python, PostgreSQL, and DuckDB. Spark's previous naive formulalog(x + sqrt(x^2 ± 1))produces results that differ by 1-2 ULP from these platforms for certain input ranges, causing cross-engine inconsistencies.The fdlibm algorithm also avoids catastrophic cancellation near x=1 (acosh) and x=0 (asinh) by using range-specific formulas (log1p, identity).
This also changes the large-value threshold from
sqrt(Double.MaxValue)to2^28, matching the fdlibm standard.Does this PR introduce any user-facing change?
Results may differ by 1-2 ULP for inputs in the
(1, 2]range (acosh) and[-2, 2]range (asinh) due to the more precise formulas. The new results are closer to the mathematically exact values and consistent with other platforms.How was this patch tested?
Updated tests in
MathExpressionsSuitewith fdlibm-aligned reference functions, plus new tests covering all branches with hardcoded reference values cross-verified against C libm.Was this patch authored or co-authored using generative AI tooling?
Yes, co-authored with Kiro.