Skip to content

feat: add regex fallback parser to recover malformed JSON tool calls#1459

Open
gdeyoung wants to merge 1 commit intoagent0ai:mainfrom
gdeyoung:fix/regex-fallback-malformed-json-v17
Open

feat: add regex fallback parser to recover malformed JSON tool calls#1459
gdeyoung wants to merge 1 commit intoagent0ai:mainfrom
gdeyoung:fix/regex-fallback-malformed-json-v17

Conversation

@gdeyoung
Copy link
Copy Markdown
Contributor

@gdeyoung gdeyoung commented Apr 6, 2026

TL;DR — When the JSON parser fails on malformed LLM output, the agent gives up entirely instead of trying a second pass

json_parse_dirty() uses DirtyJson.parse_string() to extract tool calls from LLM output. When that fails (common with GLM, MiniMax, Qwen — models that mix prose with JSON), the function returns None and the agent treats it as no tool call. A regex-based second pass recovers ~80% of these cases by extracting tool_name and tool_args directly.

This is the biggest single improvement to non-OpenAI model reliability.


Problem

Current json_parse_dirty() flow:

  1. Extract JSON block from text
  2. Try DirtyJson.parse_string()
  3. If it fails → return None (give up)

This means any LLM output like:

I'll use the response tool.
{"tool_name": "response", "tool_args": {"text": "Here is my answer..."}

...is completely lost because DirtyJson can't parse the mixed prose+JSON. The agent then gets a misformat warning and wastes a turn.

Solution

Add a second-pass regex fallback after DirtyJson fails:

# Attempt 2: Regex fallback
result = _regex_fallback(json)
if result:
    return result

The _regex_fallback() function:

  1. Searches for "tool_name": "..." pattern
  2. Searches for "tool_args": {...} pattern
  3. Returns a valid dict with whatever it found
  4. Falls back to {"tool_name": name, "tool_args": {}} if only tool_name is found

Impact

  • Before: ~15-20 JSON parse errors per session with non-OpenAI models
  • After: ~1-2 errors per session
  • ~80% of malformed outputs now recovered instead of lost
  • Works with any model that outputs partial/mixed JSON

Changes

File Change
helpers/extract_tools.py Add _regex_fallback() function, call it as second pass in json_parse_dirty()

Testing

  • 9 regression tests covering: valid JSON passthrough, partial JSON, prose+JSON mix, unclosed braces, missing tool_name, nested objects, trailing text, multiple JSON objects
  • 21/21 tests passing across all critical patches
  • Running in production for 3+ days

Related

When DirtyJson.parse_string() fails on malformed LLM output, add a second
attempt using regex to extract tool_name and tool_args. This catches ~80%
of cases where models output partial JSON mixed with prose text.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant