Call *rot to perform eigenvector update of *steqr#1120
Call *rot to perform eigenvector update of *steqr#1120sh-zheng wants to merge 1 commit intoReference-LAPACK:masterfrom
Conversation
angsch
left a comment
There was a problem hiding this comment.
I see this PR as a fit in an optimized library that lacks an optimized version of LASR. Reference LAPACK prioritizes algorithmic clarity and potential. From an algorithmic standpoint, LASR is better than ROT. Below is a breakdown of the arithmetic intensity to support this perspective. The analysis assumes real arithmetic (in complex arithmetic, the flops are higher, but the conclusion is the same).
ROT applies a single rotation to two adjacent columns.
- Load the 2 columns:
Z(1:n,l:l+1): 2 * n * sizeof(datatype) - Compute
Z(1:n,l:l+1) * P**T: 6 flops (4 multiplications, 2 additions) per row, so 6n flops in total - Store the 2 columns:
2 * n * sizeof(datatype) - arithmetic intensity = compute/data = 6n / (4n*sizeof(datatype)) ~ 1.5
LASR can be implemented to achieve higher arithmetic intensity. Although reference LAPACK does not currently support this, it has been discussed for example in #710. For the approach in #710, the analysis is as follows.
- Load 3 columns.
Z(1:n,l:l+2): 3 * n * sizeof(datatype) - Compute
Z(1:n,l:l+1) * P1**T * P2**T: 12n flops - Store 3 columns:
3 * n * sizeof(datatype) - arithmetic intensity = compute/data = 12n / (6n*sizeof(datatype) ~ 2
In the limit, LASR can achieve:
- Load n-by-n matrix:
$n^2$ * sizeof(datatype) - Apply (n-1) Givens rotations to n-by-n matrix:
$6n^2$ flops - Store n-by-n matrix:
$n^2$ * sizeof(datatype) - arithmetic intensity = compute/data =
$6n^2$ / ($2n^2$ *sizeof(datatype) ~ 3
This MR has initiated LASR3, where the arithmetic intensity is O(n). A more detailed analysis can be found in this paper.
In summary, LASR offers the potential to reduce data movement and thereby achieve higher arithmetic intensity. In contrast, ROT limits the vector update to an operation with low arithmetic intensity that cannot be improved. Even if ROT is currently faster (when linked against optimized BLAS), algorithmically, it is not the right direction for future development in reference LAPACK.
|
Thank you very much for the review @angsch . I agree with your point. I indeed didn't take it into account in terms of algorithm efficiency in this PR. My original intention was to unify the existing algorithm of symmetric matrices and skew-symmetric matrices in #1049 . It seems that a better approach would be to implement #1049 in the same way as steqr. I'll update that PR later. This PR will be closed. |
Call *rot instead of *lasr to perform eigenvector update of *steqr to fully utilize blas subroutines.
Code equivalence is verified with norm$\Vert{A-ZDZ^T}\Vert_2$ . Entries is generated randomly.
Single Precision Case
Double Precision Case
Performance is measured in millisecond and shows an improvement. The platform is a Intel Xeon 1660 v3 @ 3GHz. As an additional test, we also measured the performance of subroutine *kteqr in PR #1049 .
Single Precision Case
Double Precision Case