feat(openqa-llm-investigation): improvements#537
Draft
okurz wants to merge 8 commits intoos-autoinst:masterfrom
Draft
feat(openqa-llm-investigation): improvements#537okurz wants to merge 8 commits intoos-autoinst:masterfrom
okurz wants to merge 8 commits intoos-autoinst:masterfrom
Conversation
Member
okurz
commented
Apr 11, 2026
- feat: refine LLM bisection criteria in prompt
- feat: rename INVESTIGATE: YES/NO to BISECT: YES/NO
- fix: skip LLM investigation for softfailed jobs
- fix: broaden similar failures search in LLM investigation
- fix: correctly handle existing comments and fix unit tests
- fix: make openqa-llm-investigate idempotent
- feat: Integrate openqa-llm-investigate into investigate hook
- feat: Add openqa-llm-investigate and integrate into auto-review hook
Motivation: We want to integrate LLM analysis into the openQA investigation workflow to provide concise summaries of test failures. This helps reviewers quickly understand if an issue is a new product regression, a test regression, or an infrastructure problem, without scheduling potentially costly openqa-investigate jobs prematurely. Design Choices: - Created a standalone Python script `openqa-llm-investigate` using `typer` and `httpx`. - The script acts as a gatekeeper in `openqa-label-known-issues-and-investigate-hook`: it parses the LLM's response and only outputs the job URL to trigger further bisections if the LLM determines it is necessary. - Fetches job details, test results, and test history from the openQA API to build a comprehensive prompt. - Used `pytest` and `unittest.mock` for unit testing the Python script. - Updated the existing Bash test suite to mock and assert the execution of `openqa-llm-investigate`. Benefits: - Reduces the number of unnecessary and costly `openqa-investigate` jobs by filtering them through an LLM first. - Provides immediate, actionable summaries of failures directly as openQA comments. - Improves efficiency of test reviewers by providing context directly in the job page. Related issue: os-autoinst/os-autoinst#2857
Motivation: Ensure that costly openqa-investigate and bisection jobs are only triggered after an LLM has confirmed the necessity. Design Choices: - Modified 'investigate-and-bisect' in 'openqa-label-known-issues-and-investigate-hook' to call 'openqa-llm-investigate'. - The bash function now captures the output of the LLM script. If no URL is returned (meaning the LLM decided against investigation), the process terminates early. - Updated the test suite to include mocks for the LLM script and verified both the 'YES' and 'NO' investigation paths. Benefits: - Prevents redundant resource consumption by filtering investigation candidates through an intelligent gatekeeper. - Provides consistent behavior with the existing 'label' mechanism.
Motivation: Prevent redundant LLM analysis and duplicate comments on the same job. Also ensures that the investigation hook doesn't trigger downstream actions (like openqa-investigate) if the job has already been analyzed. Design Choices: - Added a check for existing comments starting with the LLM investigation summary header. - If found, the script exits early without outputting the job URL and without calling the LLM API. Benefits: - Saves LLM API costs. - Prevents cluttering openQA jobs with duplicate comments. - Ensures hook idempotency.
Motivation: Ensure that idempotency logic correctly parses comments and that unit tests provide accurate coverage for different API endpoints. Design Choices: - Updated 'fetch_json' to return a list on error if the URL indicates a comments API call. - Updated all test mocks to correctly distinguish between job details and comments API calls. - Verified passing unit tests. Benefits: - Robust idempotency implementation. - Reliable test suite.
Motivation: The LLM investigation was too narrow as it only searched for failures with the exact same test name. This led to incorrect 'only instance' conclusions when the same issue affected multiple different tests in the same build. Design Choices: - Updated 'build_search_query' to search for all failed jobs in the same build, regardless of the test name. - Enhanced 'test_search_query' to provide more historical context for the specific test (including arch and flavor). - Included distri, version, arch, and flavor in the prompt to give the LLM more environmental context. - Updated unit tests to support the enriched job settings and new search logic. Benefits: - More accurate LLM assessments by providing broader failure context. - Prevents false 'only instance' reports when an issue is widespread across a build.
Motivation: Jobs with a result of 'softfailed' are generally considered acceptable and do not require the same level of rigorous investigation as outright failures. Design Choices: - Updated the result check in 'openqa-llm-investigate' to include both 'passed' and 'softfailed'. - Enhanced the logging message to report the specific result that triggered the skip. - Added a new unit test case to verify correct handling of 'softfailed' jobs. Benefits: - Reduces unnecessary LLM API calls and avoids redundant analysis of non-critical issues.
Motivation: The LLM performs the 'investigation' (analysis) and decides whether an automated 'bisection' should be triggered. Renaming the decision string to 'BISECT: YES/NO' clarifies the resulting action and distinguishes it from the analysis phase. Design Choices: - Updated the LLM prompt instruction to use 'BISECT: YES/NO'. - Updated the output gatekeeper logic to check for 'BISECT: YES'. - Updated unit test mocks and assertions to match the new strings. Benefits: - Better descriptive terminology for users reading job comments. - Clearer architectural distinction between analysis and execution.
Motivation: The LLM was too aggressive in recommending bisections for widespread or long-standing issues. We need to explicitly guide it to only recommend bisection for newly introduced regressions where it provides the most value. Design Choices: - Updated the prompt to define bisection as a costly operation. - Added explicit instructions to recommend 'BISECT: NO' for infrastructure issues, widespread build failures, or long-standing historical failures. Benefits: - More judicious use of worker resources by avoiding redundant bisections. - Higher quality automated analysis.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.