Skip to content

feat(openqa-llm-investigation): improvements#537

Draft
okurz wants to merge 8 commits intoos-autoinst:masterfrom
okurz:feature/003_llm_investigation_2
Draft

feat(openqa-llm-investigation): improvements#537
okurz wants to merge 8 commits intoos-autoinst:masterfrom
okurz:feature/003_llm_investigation_2

Conversation

@okurz
Copy link
Copy Markdown
Member

@okurz okurz commented Apr 11, 2026

  • feat: refine LLM bisection criteria in prompt
  • feat: rename INVESTIGATE: YES/NO to BISECT: YES/NO
  • fix: skip LLM investigation for softfailed jobs
  • fix: broaden similar failures search in LLM investigation
  • fix: correctly handle existing comments and fix unit tests
  • fix: make openqa-llm-investigate idempotent
  • feat: Integrate openqa-llm-investigate into investigate hook
  • feat: Add openqa-llm-investigate and integrate into auto-review hook

okurz added 8 commits April 11, 2026 15:57
Motivation:
We want to integrate LLM analysis into the openQA investigation workflow to
provide concise summaries of test failures. This helps reviewers quickly
understand if an issue is a new product regression, a test regression, or an
infrastructure problem, without scheduling potentially costly openqa-investigate
jobs prematurely.

Design Choices:
- Created a standalone Python script `openqa-llm-investigate` using `typer`
  and `httpx`.
- The script acts as a gatekeeper in
  `openqa-label-known-issues-and-investigate-hook`: it parses the LLM's response
  and only outputs the job URL to trigger further bisections if the LLM
  determines it is necessary.
- Fetches job details, test results, and test history from the openQA API to
  build a comprehensive prompt.
- Used `pytest` and `unittest.mock` for unit testing the Python script.
- Updated the existing Bash test suite to mock and assert the execution of
  `openqa-llm-investigate`.

Benefits:
- Reduces the number of unnecessary and costly `openqa-investigate` jobs by
  filtering them through an LLM first.
- Provides immediate, actionable summaries of failures directly as openQA
  comments.
- Improves efficiency of test reviewers by providing context directly in the job
  page.

Related issue: os-autoinst/os-autoinst#2857
Motivation:
Ensure that costly openqa-investigate and bisection jobs are only
triggered after an LLM has confirmed the necessity.

Design Choices:
- Modified 'investigate-and-bisect' in
  'openqa-label-known-issues-and-investigate-hook' to call
  'openqa-llm-investigate'.
- The bash function now captures the output of the LLM script. If no
  URL is returned (meaning the LLM decided against investigation),
  the process terminates early.
- Updated the test suite to include mocks for the LLM script and
  verified both the 'YES' and 'NO' investigation paths.

Benefits:
- Prevents redundant resource consumption by filtering investigation
  candidates through an intelligent gatekeeper.
- Provides consistent behavior with the existing 'label' mechanism.
Motivation:
Prevent redundant LLM analysis and duplicate comments on the same job.
Also ensures that the investigation hook doesn't trigger downstream
actions (like openqa-investigate) if the job has already been
analyzed.

Design Choices:
- Added a check for existing comments starting with the LLM
  investigation summary header.
- If found, the script exits early without outputting the job URL
  and without calling the LLM API.

Benefits:
- Saves LLM API costs.
- Prevents cluttering openQA jobs with duplicate comments.
- Ensures hook idempotency.
Motivation:
Ensure that idempotency logic correctly parses comments and that unit
tests provide accurate coverage for different API endpoints.

Design Choices:
- Updated 'fetch_json' to return a list on error if the URL indicates
  a comments API call.
- Updated all test mocks to correctly distinguish between job details
  and comments API calls.
- Verified passing unit tests.

Benefits:
- Robust idempotency implementation.
- Reliable test suite.
Motivation:
The LLM investigation was too narrow as it only searched for failures
with the exact same test name. This led to incorrect 'only instance'
conclusions when the same issue affected multiple different tests in
the same build.

Design Choices:
- Updated 'build_search_query' to search for all failed jobs in the same
  build, regardless of the test name.
- Enhanced 'test_search_query' to provide more historical context for
  the specific test (including arch and flavor).
- Included distri, version, arch, and flavor in the prompt to give the
  LLM more environmental context.
- Updated unit tests to support the enriched job settings and new
  search logic.

Benefits:
- More accurate LLM assessments by providing broader failure context.
- Prevents false 'only instance' reports when an issue is widespread
  across a build.
Motivation:
Jobs with a result of 'softfailed' are generally considered acceptable
and do not require the same level of rigorous investigation as outright
failures.

Design Choices:
- Updated the result check in 'openqa-llm-investigate' to include both
  'passed' and 'softfailed'.
- Enhanced the logging message to report the specific result that
  triggered the skip.
- Added a new unit test case to verify correct handling of
  'softfailed' jobs.

Benefits:
- Reduces unnecessary LLM API calls and avoids redundant analysis of
  non-critical issues.
Motivation:
The LLM performs the 'investigation' (analysis) and decides whether an
automated 'bisection' should be triggered. Renaming the decision
string to 'BISECT: YES/NO' clarifies the resulting action and
distinguishes it from the analysis phase.

Design Choices:
- Updated the LLM prompt instruction to use 'BISECT: YES/NO'.
- Updated the output gatekeeper logic to check for 'BISECT: YES'.
- Updated unit test mocks and assertions to match the new strings.

Benefits:
- Better descriptive terminology for users reading job comments.
- Clearer architectural distinction between analysis and execution.
Motivation:
The LLM was too aggressive in recommending bisections for widespread
or long-standing issues. We need to explicitly guide it to only
recommend bisection for newly introduced regressions where it provides
the most value.

Design Choices:
- Updated the prompt to define bisection as a costly operation.
- Added explicit instructions to recommend 'BISECT: NO' for
  infrastructure issues, widespread build failures, or long-standing
  historical failures.

Benefits:
- More judicious use of worker resources by avoiding redundant
  bisections.
- Higher quality automated analysis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant