Closed
Conversation
- Add infini_train_add_test CMake macro for simplified test registration - Integrate gtest_discover_tests for automatic test case discovery - Refactor all test directories to use unified macro (autograd, optimizer, hook, slow, lora) - Reduce test CMakeLists.txt code by 68% - Add LoRA tests (12 test cases) - Delete TEST_REPORT.md - Test labels: cpu/cuda/distributed/slow for flexible test execution - Add shared test_macros.cmake in tests/common/ BREAKING CHANGE: Test registration now uses macro instead of manual add_test() Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…d signed change - Group results into improvements / regressions / normal sections - Only regressions cause exit code 1; improvements print but pass - Show signed percentage (+/-) instead of absolute error Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… paths Cast backward gradients to fp32 for bf16 compute in matmul, linear, and outer ops to preserve accumulation precision. Add vectorized no-broadcast fast paths for elementwise forward/backward kernels, skip unnecessary Fill(0) when cuBLAS beta=0 fully overwrites output, and cast saved tensors to forward compute dtype in SetupContext. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ous guards - Add FIXME in Linear::SetupContext and Matmul::SetupContext noting that an extra cast is performed because autocast runs before autograd; compute_dtype should come from autocast, not from output tensor dtype. - Add IsContiguous() to Tensor class and guard both fast paths in elementwise.cu (forward and backward) so non-contiguous tensors fall back to the broadcast path until proper stride tracking is added. - Replace silent dtype cast in AccumulateGrad with a WARNING log; grad is now used as-is when dtype mismatch is detected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Print baseline/test paths at the start of output and update argument help text. In compare_tps, flip signed_change to (test-baseline)/baseline so positive means test is faster and negative means regression.
…all cudaMallocAsync
…memory Add needs_input_grad_ tracking in autograd Function to skip unnecessary gradient allocation and computation for frozen (requires_grad=false) parameters. For LoRA fine-tuning, this avoids allocating grad_weight tensors for all frozen base model weights, reducing peak GPU memory from ~10.7GB to ~7.7GB. Also consolidate LinearBackward loose params into LinearMeta and LinearGradFlags structs for clarity.
Simplifies the Linear autograd function by removing the intermediate LinearMeta struct and passing parameters directly to kernel implementations.
…needed Previously, saved_tensors_ was set twice: first with cast tensors for both input and weight, then immediately overwritten with the needs_input_grad-conditional version without casting. This meant saved tensors were never cast to compute_dtype, causing dtype mismatches in backward.
Replace std::random_device with 42 + omp_get_thread_num() to ensure reproducible LoRA initialization across runs.
Replace TEST_F with TEST_P across all test suites so each suite runs on both CPU and CUDA without duplicating test logic. Adds InfiniTrainTestP, TensorTestBaseP, AutogradTestBaseP, and DistributedInfiniTrainTestP base classes with automatic CUDA/NCCL skip guards. Introduces INFINI_TRAIN_REGISTER_TEST* C++ macros and infini_train_add_test_suite CMake macro to eliminate repetitive INSTANTIATE_TEST_SUITE_P / infini_train_add_test boilerplate. Removes deprecated test/, slow/, and split optimizer test files; consolidates optimizer tests into a single binary with creation + step suites.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.