Conversation
|
I've added fixes for tests enabled with training. Although tests are for training, a lot of fixes are actually in common code. |
c7d2c7f to
3aa7f73
Compare
|
@tianleiwu @amarin16 @baijumeswani, could you please take a look? |
1.
|
There was a problem hiding this comment.
Pull request overview
This PR addresses big-endian (s390x) correctness issues in ORT by standardizing raw tensor data handling (writing/reading in little-endian form where required) and adjusting affected tests/build wiring so the test suite passes without training enabled.
Changes:
- Replace direct
TensorProto::set_raw_data(...)usage withonnxruntime::utils::SetRawDataInTensorProto(...)across multiple components/tests to centralize endianness handling. - Improve test robustness on big-endian by unpacking tensor proto data via ORT utilities (e.g.,
UnpackTensor,ConvertRawDataInTensorProto) instead ofmemcpy/reinterpretation. - Update ORT-format/flatbuffer initializer handling and unit-test build configuration to support big-endian scenarios and non-training builds.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| winml/adapter/winml_adapter_model.cpp | Use ORT helper to set TensorProto raw data with endian-awareness. |
| orttraining/orttraining/training_api/checkpoint.cc | Remove little-endian-only guard for training checkpoints. |
| orttraining/orttraining/test/graph/optimizer_graph_builder_test.cc | Read raw_data via UnpackTensor to be endian-correct. |
| orttraining/orttraining/core/optimizer/shape_optimizer.cc | Use endian-aware raw data setter for constant initializers. |
| orttraining/orttraining/core/optimizer/megatron_transformer.cc | Use endian-aware raw data setter for partitioned initializers. |
| orttraining/orttraining/core/optimizer/conv1d_replacement.cc | Use endian-aware raw data setter for initializer creation. |
| orttraining/orttraining/core/framework/checkpointing.cc | Remove big-endian “not implemented” restriction in checkpoint saving path. |
| onnxruntime/test/providers/nv_tensorrt_rtx/test_nv_trt_rtx_ep_util.cc | Use endian-aware raw data setter in test model building utilities. |
| onnxruntime/test/framework/sparse_kernels_test.cc | Convert/check raw_data in a big-endian-safe way (copy-by-value + conversion). |
| onnxruntime/test/framework/int2_test.cc | Use endian-aware raw data setter in Int2 round-trip test. |
| onnxruntime/test/framework/endian_test.cc | Use endian-aware raw data setter and add stronger assertions about conversion effects. |
| onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc | Remove manual conversion now handled elsewhere. |
| onnxruntime/core/optimizer/qdq_transformer/where_dummy_dq.cc | Use endian-aware raw data setter for dummy initializer scalars. |
| onnxruntime/core/graph/graph_flatbuffers_utils.cc | Add big-endian handling for saving/loading ORT-format initializers and tensor dims. |
| onnxruntime/core/graph/graph.cc | Remove prior sparse-constant endian workaround; use endian-aware raw data setter for editor API. |
| onnxruntime/core/framework/tensorprotoutils.cc | Enhance raw-data setter and conversion logic; ensure sparse raw bytes are little-endian. |
| onnxruntime/core/framework/data_transfer_utils.h | Add byte-swapping to CopyTensorDataToByteSpan on big-endian. |
| cmake/onnxruntime_unittests.cmake | Ensure endian_utils is linked into tests when training is disabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if constexpr (endian::native != endian::little) { | ||
| gsl::span<std::byte> bytes_span(reinterpret_cast<std::byte*>(dst_span.data()), dst_span.size_bytes()); | ||
| onnxruntime::utils::SwapByteOrderInplace(src_tensor.DataType()->Size(), bytes_span); | ||
| } |
There was a problem hiding this comment.
CopyTensorDataToByteSpan now unconditionally byte-swaps the destination buffer on big-endian hosts. This helper is used outside of checkpoint/serialization codepaths (e.g., copying device tensors into a NumPy array in the Python bindings), where callers expect a pure byte-for-byte copy in native format. Please avoid changing CopyTensorDataToByteSpan semantics globally; instead move the endian conversion to the specific serialization/checkpointing call sites that require little-endian on-disk representation, or add an explicit opt-in parameter/helper for "copy as little-endian".
There was a problem hiding this comment.
Looks like a correct catch, I'll rework this part.
| auto be_copy{initializer}; | ||
| onnxruntime::utils::ConvertRawDataInTensorProto(be_copy); | ||
| // If data is external, then call to ConvertRawDataInTensorProto will not properly byteswap it. | ||
| // A conversion using TensorProtoWithExternalDataToTensorProto does it. | ||
| // Do it after call to ConvertRawDataInTensorProto to ensure all data is byteswapped. | ||
| if (onnxruntime::utils::HasExternalData(be_copy)) { | ||
| auto be_copy_external_data{be_copy}; | ||
| ORT_RETURN_IF_ERROR(onnxruntime::utils::TensorProtoWithExternalDataToTensorProto(be_copy_external_data, {}, be_copy)); | ||
| } | ||
| ORT_RETURN_IF_ERROR(onnxruntime::utils::UnpackInitializerData(be_copy, model_path, unpacked_tensor)); |
There was a problem hiding this comment.
This external-data handling in SaveInitializerOrtFormat has two issues on big-endian hosts:
- It passes
{}into TensorProtoWithExternalDataToTensorProto, but that value is forwarded as thetensor_proto_dirfor ReadExternalDataForTensor, so external data may be resolved relative to the current working directory instead of the model directory. - After loading external bytes into
be_copy.raw_data, the laterUnpackInitializerData(be_copy, ...)call will invoke ReadLittleEndian and convert the bytes to native endianness before writing them into the flatbuffer, but ORT format raw_data is expected to be little-endian (LoadOrtTensorOrtFormat unpacks it with ReadLittleEndian).
Consider reading external bytes directly intounpacked_tensor(no UnpackTensor conversion) and using the correct directory derived frommodel_pathwhen resolving the external data.
| auto be_copy{initializer}; | |
| onnxruntime::utils::ConvertRawDataInTensorProto(be_copy); | |
| // If data is external, then call to ConvertRawDataInTensorProto will not properly byteswap it. | |
| // A conversion using TensorProtoWithExternalDataToTensorProto does it. | |
| // Do it after call to ConvertRawDataInTensorProto to ensure all data is byteswapped. | |
| if (onnxruntime::utils::HasExternalData(be_copy)) { | |
| auto be_copy_external_data{be_copy}; | |
| ORT_RETURN_IF_ERROR(onnxruntime::utils::TensorProtoWithExternalDataToTensorProto(be_copy_external_data, {}, be_copy)); | |
| } | |
| ORT_RETURN_IF_ERROR(onnxruntime::utils::UnpackInitializerData(be_copy, model_path, unpacked_tensor)); | |
| // On big-endian hosts we must be careful to preserve the little-endian layout | |
| // required by ORT format raw_data. | |
| if (onnxruntime::utils::HasExternalData(initializer)) { | |
| // Resolve external data relative to the model directory and read the bytes | |
| // directly into unpacked_tensor without any endian conversion. | |
| auto tensor_with_external{initializer}; | |
| auto tensor_with_data{initializer}; | |
| const auto model_dir = model_path.parent_path().string(); | |
| ORT_RETURN_IF_ERROR( | |
| onnxruntime::utils::TensorProtoWithExternalDataToTensorProto(tensor_with_external, model_dir, tensor_with_data)); | |
| unpacked_tensor.assign(tensor_with_data.raw_data().begin(), tensor_with_data.raw_data().end()); | |
| } else { | |
| auto be_copy{initializer}; | |
| onnxruntime::utils::ConvertRawDataInTensorProto(be_copy); | |
| ORT_RETURN_IF_ERROR(onnxruntime::utils::UnpackInitializerData(be_copy, model_path, unpacked_tensor)); | |
| } |
There was a problem hiding this comment.
This part is reworked now
|
Thanks for review, I'll rework that change. It is likely that byteswapping is excessive in that place but missing in some other place. |
3aa7f73 to
067dd42
Compare
Build command: ./build.sh --config Debug --parallel 0 --enable_pybind --build_wheel --allow_running_as_root
Later this data is narrowed: *p_data++ = static_cast<T>(*data_iter); If for example BE int32_t data is 4 bytes: 0x00 0x00 0x00 0x01 After byteswapping it'll become: 0x01 0x00 0x00 0x00 And after narrowing to int16_t two rightmost bytes are used on big endian and result is 0x00 0x00 If instead we byteswap it as two shorts, byteswapping result is: 0x00 0x00 0x01 0x00 And narrowing result is 0x01 0x00, which is correct LE representation of that number. This change fixes following test on s390x: FlatbufferUtilsTest.ExternalWriteReadWithLoadInitializers
Raw data is expected to be in LE. This change fixes tests: SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices64 SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices32 SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices16 SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices8 SparseTensorConversionTests.SparseTensorProtoToDense_Rank2Indices_COO
This change fixes tests: OrtModelOnlyTests.SparseInitializerHandling SparseTensorConversionTests.TestConstantNodeConversion
This change fixes following tests on s390x: ExecutionFrameTestInit.SparseInitializerAsOutput CApiTest.SparseOutputModel
This change will allow to assess and fix big-endian-specific issues in training-related code.
This change fixes approximately 40 tests.
This change fixes test CheckpointingTest.SaveAndLoad on s390x.
…ronTransformer class This change fixes following tests on s390x: GraphTransformationTests.MegatronMLPPartitionRank0 GraphTransformationTests.MegatronMLPPartitionRank1 GraphTransformationTests.MegatronSelfAttentionPartitionRank0 GraphTransformationTests.MegatronSelfAttentionPartitionRank1
…roto This should fix a lot of potential endianness issues on s390x
This change fixes following tests on s390x: OptimizerGraphBuilderTest.LoadOptimState_FullPrecision_Adam OptimizerGraphBuilderTest.LoadOptimState_FullPrecision_Lamb
Memory data is in native endian format, while on-disk data should be in little endian format already. Move out a part of ConvertRawDataInTensorProto function into a separate one for convenience. This change fixes test OrtModelOnlyTests.ValidateOrtFormatModelDoesNotRunOptimizersInFullBuild on s390x.
This change fixes following tests on s390x: SaveWithExternalInitializers.Mnist SaveWithExternalInitializers.ModelWithOriginalExternalData SaveWithExternalInitializers.ModelWithOriginalExternalDataAlignOffset
…unction ReadExternalDataForTensor already returns data in native endian format. Also remove ConvertEndianessForVector function. It does const_cast and unexpectedly modifies original data. When needed, use WriteLittleEndian instead. These changes fix tests on s390x: TensorProtoUtilsTest.UnpackTensorWithExternalData TensorProtoUtilsTest.ConstantTensorProtoWithExternalData
…onstantNodeConversion test
067dd42 to
b8042be
Compare
|
Your finding is entirely correct if external data is in file. However, in-memory external data is actually in native endian format, i.e. big endian on big endian systems: onnxruntime/onnxruntime/core/framework/tensorprotoutils.cc Lines 1803 to 1815 in 65fb61b It seems like a bad idea to byteswap this memory data in advance, so I've added byteswapping of data from file after reading it. It also revealed a couple additional byteswapping issues which I also investigated and fixed. |
tianleiwu
left a comment
There was a problem hiding this comment.
GetElementSizeOfTensor Extraction (tensorprotoutils.cc)
⚠️ Missing STRING type returns 0, propagated to ReadLittleEndian without guard:GetElementSizeOfTensorreturns 0 for unknown/unsupported types (e.g., STRING). Most callers guard withif (element_size > 1), but inReadExternalDataForTensorfor thekTensorProtoLittleEndianMemoryAddressTagpath,ReadLittleEndian(element_size, src, dst)is called unconditionally. Ifelement_sizeis 0,CopyLittleEndianreceives 0 forelement_size_in_bytes, which violates its documented precondition ("element_size_in_bytes should be greater than zero"). In practice this path is unlikely for STRING tensors (strings don't use raw external data), but a defensive check would be prudent:size_t element_size = GetElementSizeOfTensor(...); ORT_RETURN_IF(element_size == 0, "Unsupported data type for endian conversion");
GetExtDataFromTensorProto — WriteLittleEndian Misuse (tensorprotoutils.cc)
⚠️ WriteLittleEndianused whereReadLittleEndianis semantically correct: In thekTensorProtoLittleEndianMemoryAddressTagbranch ofGetExtDataFromTensorProto, the source buffer (ext_data_buf) is in little-endian format and the destination (native_data) should be in native endian format. This is a "read from LE" operation, soReadLittleEndianshould be used. The code instead callsWriteLittleEndian(element_size, src_span, dst_span), which is documented as "writes to a little-endian destination." Both functions are functionally identical (byte swap is self-inverse), so there is no runtime bug, but the naming is misleading and inconsistent with the DML provider code which correctly usesReadLittleEndianin the same scenario.// Current (misleading): ORT_RETURN_IF_ERROR(onnxruntime::utils::WriteLittleEndian(element_size, src_span, dst_span)); // Should be: ORT_RETURN_IF_ERROR(onnxruntime::utils::ReadLittleEndian(element_size, src_span, dst_span));
Save Path Byte-Swapping (graph.cc, graph_flatbuffers_utils.cc)
⚠️ Duplicated byte-swap boilerplate inSaveOrtTensorOrtFormat: The inline-data and external-data-writer paths inSaveOrtTensorOrtFormatboth contain nearly identical byte-swap logic (allocate vector, compute element_size, WriteLittleEndian). Consider extracting into a local lambda or helper to reduce duplication. This is a readability concern, not a correctness issue.
Summary of Concerns
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | Suggestion | GetElementSizeOfTensor |
Returns 0 for unknown types; no guard before ReadLittleEndian in the kTensorProtoLittleEndianMemoryAddressTag path of ReadExternalDataForTensor. |
| 2 | Suggestion | GetExtDataFromTensorProto |
WriteLittleEndian used where ReadLittleEndian is semantically correct (LE source → native dest). Functionally identical but misleading. |
| 3 | Nitpick | SaveOrtTensorOrtFormat |
Duplicated byte-swap boilerplate between inline and external data writer paths could be extracted. |
Verdict
APPROVE — The PR correctly addresses a systemic class of big-endian bugs across the entire data serialization stack. The core architectural decisions (tag split, centralized SetRawDataInTensorProto conversion, ReadExternalDataForTensor returning native endian) are sound. The ConvertRawDataInTensorProto fix for sub-word types stored in wider protobuf fields is critical and correct. The two suggestion-level concerns (missing guard for element_size=0 and WriteLittleEndian/ReadLittleEndian naming mismatch) are worth fixing but are not blockers — the former is unreachable in practice and the latter is a semantic-only issue with no runtime impact.
e2becb5 to
7e1162b
Compare
7e1162b to
38983f1
Compare
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Head branch was pushed to by a user without write access
|
I've added change to include endian headers in a couple of files due to failing pipeline indicating this issue in one of build configurations. Edit: failing pipeline: https://github.com/microsoft/onnxruntime/actions/runs/23539560419 |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
@AlekseiNikiforovIBM, there is still build errors in DirectML builds (please also check other failed builds): |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Head branch was pushed to by a user without write access
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Description
This PR contains fixes to various big endian support issues in onnxruntime, both in libraries and tests.
Motivation and Context
Currently some tests from onnxruntime testsuite fail.
This change fixes all tests from onnxruntime testsuite when it's built without training support.
It also includes a linking issue fix.
Following tests are fixed on s390x:
OrtModelOnlyTests.ValidateOrtFormatModelDoesNotRunOptimizersInFullBuild
FlatbufferUtilsTest.ExternalWriteReadWithLoadInitializers
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices64
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices32
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices16
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices8
SparseTensorConversionTests.SparseTensorProtoToDense_Rank2Indices_COO
SparseTensorConversionTests.TestConstantNodeConversion
OrtModelOnlyTests.SparseInitializerHandling
SparseTensorConversionTests.TestConstantNodeConversion
SparseTensorConversionTests.TestDenseToSparseConversion
ExecutionFrameTestInit.SparseInitializerAsOutput
CApiTest.SparseOutputModel