test: fix flaky key-order assertions in structured output tests by BrendanWalsh · Pull Request #2537 · microsoft/SynapseML

BrendanWalsh · 2026-03-31T16:37:28Z

Problem

The 4 tests in test_ResponseFormatOrder.py assert that JSON keys in the model response appear in a specific order matching the schema property order. However, OpenAI does not guarantee that response keys will follow the schema ordering, making these tests non-deterministic.

CI failure (ADO build 213040246, test run 837042988):

AssertionError: False is not true : Expected reason before ans.
Output: {"ans":"Paris, New York","reason":"You asked for two cities; I provided Paris and New York as examples."}

Fix

Replaced text.find() key-order assertions with JSON parsing + key presence checks
Tests now validate the model response is valid JSON with the expected keys (reason and ans) and correct types
Extracted a shared _assert_valid_response helper with a docstring explaining the rationale
Removed unused os import

Why this is correct

The library feature (preserving schema property ordering in the API request via LinkedHashMap) is already thoroughly tested by the Scala unit tests in ResponseFormatOrderSuite.scala, which verify serialization ordering without hitting the API. The Python integration tests should verify that structured output works (returns valid JSON matching the schema), not that the model happens to return keys in a particular order.

github-actions · 2026-03-31T16:40:13Z

Hey @BrendanWalsh 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

Copilot

Pull request overview

This PR updates the Python OpenAI integration tests to avoid flaky assertions about JSON key ordering in model responses, shifting the tests to validate structured output correctness (valid JSON + expected keys/types) instead of response property order.

Changes:

Replaced response key-order assertions (text.find(...)) with JSON parsing and key/type validation.
Added a shared _assert_valid_response helper documenting why key-order is not asserted.
Removed an unused os import.

cognitive/src/test/python/synapsemltest/services/openai/test_ResponseFormatOrder.py

Remove non-deterministic assertions on JSON key ordering in model responses. LLMs do not guarantee response key order matches schema property order. Tests now validate structured output correctness (valid JSON with expected keys and types) instead. Rename module/class/methods to reflect the new intent: - test_ResponseFormatOrder.py -> test_StructuredOutput.py - TestResponseFormatOrder -> TestStructuredOutput - test_*_reason_then_ans -> test_*_structured_output_reason_first - test_*_ans_then_reason -> test_*_structured_output_ans_first Extract shared _make_prompt helper to reduce test boilerplate. Schema ordering preservation is already covered by the Scala ResponseFormatOrderSuite unit tests.

BrendanWalsh · 2026-04-01T04:02:53Z

/azp run

azure-pipelines · 2026-04-01T04:03:05Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2026-04-01T04:21:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.61%. Comparing base (1152f50) to head (391ac2c).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2537   +/-   ##
=======================================
  Coverage   84.61%   84.61%           
=======================================
  Files         335      335           
  Lines       17708    17708           
  Branches     1612     1612           
=======================================
  Hits        14984    14984           
  Misses       2724     2724

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings March 31, 2026 16:37

Copilot started reviewing on behalf of BrendanWalsh March 31, 2026 16:42 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

cognitive/src/test/python/synapsemltest/services/openai/test_ResponseFormatOrder.py Outdated Show resolved Hide resolved

BrendanWalsh force-pushed the fix/flaky-json-order-test branch 2 times, most recently from 6aa723d to a0aa012 Compare March 31, 2026 23:37

BrendanWalsh changed the title ~~fix: remove flaky key-order assertions from response format tests~~ test: fix flaky key-order assertions in structured output tests Mar 31, 2026

BrendanWalsh force-pushed the fix/flaky-json-order-test branch from a0aa012 to 391ac2c Compare March 31, 2026 23:53

BrendanWalsh merged commit 76bd8e3 into microsoft:master Apr 1, 2026
70 of 74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix flaky key-order assertions in structured output tests#2537

test: fix flaky key-order assertions in structured output tests#2537
BrendanWalsh merged 1 commit intomicrosoft:masterfrom
BrendanWalsh:fix/flaky-json-order-test

BrendanWalsh commented Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

BrendanWalsh commented Apr 1, 2026

Uh oh!

azure-pipelines bot commented Apr 1, 2026

Uh oh!

codecov-commenter commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BrendanWalsh commented Mar 31, 2026

Problem

Fix

Why this is correct

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

BrendanWalsh commented Apr 1, 2026

Uh oh!

azure-pipelines bot commented Apr 1, 2026

Uh oh!

codecov-commenter commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Apr 1, 2026 •

edited

Loading