Skip to content

test: fix flaky key-order assertions in structured output tests#2537

Merged
BrendanWalsh merged 1 commit intomicrosoft:masterfrom
BrendanWalsh:fix/flaky-json-order-test
Apr 1, 2026
Merged

test: fix flaky key-order assertions in structured output tests#2537
BrendanWalsh merged 1 commit intomicrosoft:masterfrom
BrendanWalsh:fix/flaky-json-order-test

Conversation

@BrendanWalsh
Copy link
Copy Markdown
Collaborator

Problem

The 4 tests in test_ResponseFormatOrder.py assert that JSON keys in the model response appear in a specific order matching the schema property order. However, OpenAI does not guarantee that response keys will follow the schema ordering, making these tests non-deterministic.

CI failure (ADO build 213040246, test run 837042988):

AssertionError: False is not true : Expected reason before ans.
Output: {"ans":"Paris, New York","reason":"You asked for two cities; I provided Paris and New York as examples."}

Fix

  • Replaced text.find() key-order assertions with JSON parsing + key presence checks
  • Tests now validate the model response is valid JSON with the expected keys (reason and ans) and correct types
  • Extracted a shared _assert_valid_response helper with a docstring explaining the rationale
  • Removed unused os import

Why this is correct

The library feature (preserving schema property ordering in the API request via LinkedHashMap) is already thoroughly tested by the Scala unit tests in ResponseFormatOrderSuite.scala, which verify serialization ordering without hitting the API. The Python integration tests should verify that structured output works (returns valid JSON matching the schema), not that the model happens to return keys in a particular order.

Copilot AI review requested due to automatic review settings March 31, 2026 16:37
@github-actions
Copy link
Copy Markdown

Hey @BrendanWalsh 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Python OpenAI integration tests to avoid flaky assertions about JSON key ordering in model responses, shifting the tests to validate structured output correctness (valid JSON + expected keys/types) instead of response property order.

Changes:

  • Replaced response key-order assertions (text.find(...)) with JSON parsing and key/type validation.
  • Added a shared _assert_valid_response helper documenting why key-order is not asserted.
  • Removed an unused os import.

@BrendanWalsh BrendanWalsh force-pushed the fix/flaky-json-order-test branch 2 times, most recently from 6aa723d to a0aa012 Compare March 31, 2026 23:37
@BrendanWalsh BrendanWalsh changed the title fix: remove flaky key-order assertions from response format tests test: fix flaky key-order assertions in structured output tests Mar 31, 2026
Remove non-deterministic assertions on JSON key ordering in model
responses. LLMs do not guarantee response key order matches schema
property order. Tests now validate structured output correctness
(valid JSON with expected keys and types) instead.

Rename module/class/methods to reflect the new intent:
- test_ResponseFormatOrder.py -> test_StructuredOutput.py
- TestResponseFormatOrder -> TestStructuredOutput
- test_*_reason_then_ans -> test_*_structured_output_reason_first
- test_*_ans_then_reason -> test_*_structured_output_ans_first

Extract shared _make_prompt helper to reduce test boilerplate.
Schema ordering preservation is already covered by the Scala
ResponseFormatOrderSuite unit tests.
@BrendanWalsh BrendanWalsh force-pushed the fix/flaky-json-order-test branch from a0aa012 to 391ac2c Compare March 31, 2026 23:53
@BrendanWalsh
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.61%. Comparing base (1152f50) to head (391ac2c).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2537   +/-   ##
=======================================
  Coverage   84.61%   84.61%           
=======================================
  Files         335      335           
  Lines       17708    17708           
  Branches     1612     1612           
=======================================
  Hits        14984    14984           
  Misses       2724     2724           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@BrendanWalsh BrendanWalsh merged commit 76bd8e3 into microsoft:master Apr 1, 2026
70 of 74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants