feat(openqa-llm-investigation): improvements by okurz · Pull Request #537 · os-autoinst/os-autoinst-scripts

okurz · 2026-04-11T18:04:17Z

feat: refine LLM bisection criteria in prompt
feat: rename INVESTIGATE: YES/NO to BISECT: YES/NO
fix: skip LLM investigation for softfailed jobs
fix: broaden similar failures search in LLM investigation
fix: correctly handle existing comments and fix unit tests
fix: make openqa-llm-investigate idempotent
feat: Integrate openqa-llm-investigate into investigate hook
feat: Add openqa-llm-investigate and integrate into auto-review hook

Motivation: We want to integrate LLM analysis into the openQA investigation workflow to provide concise summaries of test failures. This helps reviewers quickly understand if an issue is a new product regression, a test regression, or an infrastructure problem, without scheduling potentially costly openqa-investigate jobs prematurely. Design Choices: - Created a standalone Python script `openqa-llm-investigate` using `typer` and `httpx`. - The script acts as a gatekeeper in `openqa-label-known-issues-and-investigate-hook`: it parses the LLM's response and only outputs the job URL to trigger further bisections if the LLM determines it is necessary. - Fetches job details, test results, and test history from the openQA API to build a comprehensive prompt. - Used `pytest` and `unittest.mock` for unit testing the Python script. - Updated the existing Bash test suite to mock and assert the execution of `openqa-llm-investigate`. Benefits: - Reduces the number of unnecessary and costly `openqa-investigate` jobs by filtering them through an LLM first. - Provides immediate, actionable summaries of failures directly as openQA comments. - Improves efficiency of test reviewers by providing context directly in the job page. Related issue: os-autoinst/os-autoinst#2857

Motivation: Ensure that costly openqa-investigate and bisection jobs are only triggered after an LLM has confirmed the necessity. Design Choices: - Modified 'investigate-and-bisect' in 'openqa-label-known-issues-and-investigate-hook' to call 'openqa-llm-investigate'. - The bash function now captures the output of the LLM script. If no URL is returned (meaning the LLM decided against investigation), the process terminates early. - Updated the test suite to include mocks for the LLM script and verified both the 'YES' and 'NO' investigation paths. Benefits: - Prevents redundant resource consumption by filtering investigation candidates through an intelligent gatekeeper. - Provides consistent behavior with the existing 'label' mechanism.

Motivation: Prevent redundant LLM analysis and duplicate comments on the same job. Also ensures that the investigation hook doesn't trigger downstream actions (like openqa-investigate) if the job has already been analyzed. Design Choices: - Added a check for existing comments starting with the LLM investigation summary header. - If found, the script exits early without outputting the job URL and without calling the LLM API. Benefits: - Saves LLM API costs. - Prevents cluttering openQA jobs with duplicate comments. - Ensures hook idempotency.

Motivation: Ensure that idempotency logic correctly parses comments and that unit tests provide accurate coverage for different API endpoints. Design Choices: - Updated 'fetch_json' to return a list on error if the URL indicates a comments API call. - Updated all test mocks to correctly distinguish between job details and comments API calls. - Verified passing unit tests. Benefits: - Robust idempotency implementation. - Reliable test suite.

Motivation: The LLM investigation was too narrow as it only searched for failures with the exact same test name. This led to incorrect 'only instance' conclusions when the same issue affected multiple different tests in the same build. Design Choices: - Updated 'build_search_query' to search for all failed jobs in the same build, regardless of the test name. - Enhanced 'test_search_query' to provide more historical context for the specific test (including arch and flavor). - Included distri, version, arch, and flavor in the prompt to give the LLM more environmental context. - Updated unit tests to support the enriched job settings and new search logic. Benefits: - More accurate LLM assessments by providing broader failure context. - Prevents false 'only instance' reports when an issue is widespread across a build.

Motivation: Jobs with a result of 'softfailed' are generally considered acceptable and do not require the same level of rigorous investigation as outright failures. Design Choices: - Updated the result check in 'openqa-llm-investigate' to include both 'passed' and 'softfailed'. - Enhanced the logging message to report the specific result that triggered the skip. - Added a new unit test case to verify correct handling of 'softfailed' jobs. Benefits: - Reduces unnecessary LLM API calls and avoids redundant analysis of non-critical issues.

Motivation: The LLM performs the 'investigation' (analysis) and decides whether an automated 'bisection' should be triggered. Renaming the decision string to 'BISECT: YES/NO' clarifies the resulting action and distinguishes it from the analysis phase. Design Choices: - Updated the LLM prompt instruction to use 'BISECT: YES/NO'. - Updated the output gatekeeper logic to check for 'BISECT: YES'. - Updated unit test mocks and assertions to match the new strings. Benefits: - Better descriptive terminology for users reading job comments. - Clearer architectural distinction between analysis and execution.

Motivation: The LLM was too aggressive in recommending bisections for widespread or long-standing issues. We need to explicitly guide it to only recommend bisection for newly introduced regressions where it provides the most value. Design Choices: - Updated the prompt to define bisection as a costly operation. - Added explicit instructions to recommend 'BISECT: NO' for infrastructure issues, widespread build failures, or long-standing historical failures. Benefits: - More judicious use of worker resources by avoiding redundant bisections. - Higher quality automated analysis.

okurz added 8 commits April 11, 2026 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openqa-llm-investigation): improvements#537

feat(openqa-llm-investigation): improvements#537
okurz wants to merge 8 commits intoos-autoinst:masterfrom
okurz:feature/003_llm_investigation_2

okurz commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

okurz commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant