Skip to content

While compiling the CUDA-aware version of abacus_2g on the Shuguang supercomputer, numerical errors were observed in the GPU tests. #7076

@LiYuqiii

Description

@LiYuqiii

Describe the Testing Issue

The following issues were encountered during GPU testing of the CUDA-aware version of abacus_2g:
11_PW_GPU/001_PW_BPCG_GPU and 16_SDFT_GPU/005_PW_SDFT_MALL_BPCG_GPU :terminate prematurely and report this in the error log:
terminate called after throwing an instance of 'std::runtime_error'
what(): trtri: failed to invert matrix

Image

16_SDFT_GPU/002_PW_SKG_MALL_GPU:The results show significant discrepancies from the reference values:

Image

reference values:

Image

calculated Values:

Image

After investigation, it was found that the results are correct when using a single DCU, but this error occurs only when using two DCUs; after disabling CUDA-aware compilation, the results for two DCUs were also correct.

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Understand the testing issue described by the developer.
  • Review the specific test case, expected and actual results, and any error messages.
  • Identify the root cause of the test failure or issue.
  • If a possible solution is suggested, evaluate its feasibility and effectiveness.
  • Implement a fix for the test failure or issue, or create a new test case if needed.
  • Verify that the fix resolves the testing issue and the test case passes.
  • Review and update any relevant documentation, such as test plans or user guides.
  • Ensure the testing issue is resolved and close the ticket.
  • Share any lessons learned or best practices with the team to prevent similar issues in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugsBugs that only solvable with sufficient knowledge of DFTGPU & DCU & HPCGPU and DCU and HPC related any issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions