Skip to content

Add tool usage visibility to chat messages with popover details#880

Merged
JSv4 merged 17 commits intomainfrom
claude/expose-tool-usage-k5Q20
Feb 16, 2026
Merged

Add tool usage visibility to chat messages with popover details#880
JSv4 merged 17 commits intomainfrom
claude/expose-tool-usage-k5Q20

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Feb 11, 2026

Summary

This PR enhances the chat UI to expose tool usage information to users by adding a tool usage badge to assistant messages and a detailed popover showing each tool call's inputs and outputs.

Key Changes

Frontend (ChatMessage Component)

  • Tool Usage Badge: Added ToolUsageIndicator component that displays a wrench icon badge ("X tools used") in the message header for assistant messages that executed tools
  • Tool Call Popover: Implemented a rich hover popover showing:
    • Tool name (formatted from snake_case to Title Case)
    • Input arguments (displayed as formatted JSON)
    • Output result (displayed in a highlighted result block)
  • Styled Components: Added comprehensive styling for the badge, popover, and tool call cards with responsive design and smooth animations
  • Helper Functions:
    • extractToolCalls(): Pairs tool_call and tool_result timeline entries
    • formatToolName(): Converts tool names to user-friendly display format
  • UI Integration: Tool badge appears in a new MessageHeader container alongside the username, replacing the previous TimelineIndicator when tools are used

Backend (Timeline & Agent Updates)

  • Timeline Schema: Added result field to timeline entries to store tool output content
  • Timeline Builder: Modified to capture tool result content from ThoughtEvent metadata and truncate to 500 characters to keep payloads lightweight
  • Agent Handlers: Updated tool result emission for:
    • similarity_search: Emits result summary ("Found X matching annotations")
    • search_exact_text: Emits result summary ("Found X exact text matches")
    • ask_document: Extracts and emits the answer text
    • Generic tools: Converts result content to string
  • Data Flow: Tool result content is passed through ThoughtEvent metadata and captured in timeline entries

Data Model Updates

  • Updated TimelineEntry interface to include optional result field
  • Added ToolCallInfo interface for popover data structure
  • Updated chat components (CorpusChat, ChatTray) to pass tool_result from metadata to timeline entries

Implementation Details

  • Tool usage detection uses timeline.some((e) => e.type === "tool_call") to determine if badge should display
  • Popover uses Framer Motion for smooth animations (opacity, scale, and position transitions)
  • Result content is truncated at 500 characters to balance visibility with payload size
  • Responsive design with mobile-specific adjustments for badge and popover sizing
  • Hover state management with debounced close timeout to prevent flickering

https://claude.ai/code/session_01GHGe5Ye7Ag9uDrw31mSBXc

Add a dedicated tool usage indicator to assistant messages in both
document and corpus chat views. When an agent uses tools, a blue
wrench badge ("X tools used") appears in the message header. Hovering
over it shows a rich popover with each tool call's name, input
arguments, and output result.

Backend changes capture tool result content in timeline entries (previously
only the tool name was stored). Search tools (similarity_search,
search_exact_text) now also emit tool_result timeline entries with
human-readable summaries.

- Add `result` field to TimelineEntry schema and TypedDict
- Capture tool result content in TimelineBuilder (truncated to 500 chars)
- Pass tool result through ThoughtEvent metadata in pydantic_ai_agents
- Add ToolUsageIndicator component with hover ToolCallPopover
- Add MessageHeader layout for inline badge display
- Pass tool_result from WebSocket data in CorpusChat and ChatTray

https://claude.ai/code/session_01GHGe5Ye7Ag9uDrw31mSBXc
@claude
Copy link
Copy Markdown

claude Bot commented Feb 11, 2026

Code Review: PR #880 - Add tool usage visibility to chat messages

Overall Assessment

This PR adds valuable UX improvements by exposing tool usage information in the chat UI. The implementation is well-structured with good separation of concerns between frontend and backend. However, there are several areas that need attention before merging.

✅ Strengths

  1. Good Architecture: Clean separation between data model changes (timeline schema), data capture (backend agents), and UI presentation (frontend components)
  2. Performance Conscious: 500-character truncation limit on tool results prevents payload bloat
  3. User Experience: Hover popover pattern provides details on-demand without cluttering the UI
  4. Type Safety: Proper TypeScript interfaces (ToolCallInfo, TimelineEntry with result field)
  5. Changelog Documentation: Comprehensive changelog entries following project conventions

🔴 Critical Issues

1. Missing Cleanup in ToolUsageIndicator (frontend/src/components/widgets/chat/ChatMessage.tsx:1158)

The closeTimeoutRef is never cleaned up when the component unmounts, which can cause memory leaks and "setState on unmounted component" warnings.

Fix needed:

const ToolUsageIndicator: React.FC<{
  timeline: TimelineEntry[];
}> = ({ timeline }) => {
  const toolCalls = useMemo(() => extractToolCalls(timeline), [timeline]);
  const [isOpen, setIsOpen] = useState(false);
  const closeTimeoutRef = useRef<ReturnType<typeof setTimeout> | null>(null);

  // ADD THIS useEffect:
  useEffect(() => {
    return () => {
      if (closeTimeoutRef.current) {
        clearTimeout(closeTimeoutRef.current);
      }
    };
  }, []);

  // ... rest of component

2. Tool Call Pairing Logic May Fail (frontend/src/components/widgets/chat/ChatMessage.tsx:1119-1142)

The extractToolCalls function assumes tool results immediately follow their calls with matching tool names. This could fail if:

  • Multiple calls to the same tool occur in sequence
  • Tool results arrive out of order
  • A tool call has no matching result (failure case)

Issue: If you call similarity_search twice in a row, the first call might match the second result.

Suggested improvement:
Consider adding a unique identifier to tool_call/tool_result entries in the backend or using index-based pairing with a "consumed" flag to ensure correct matching.

3. Potential XSS Risk (frontend/src/components/widgets/chat/ChatMessage.tsx:1095)

Tool arguments are rendered with JSON.stringify but tool results are rendered directly:

<ToolCallResultBlock>{call.result}</ToolCallResultBlock>

If tool results contain user-generated content or malicious strings, this could be an XSS vector. The CLAUDE.md mentions: "User-generated content in JSON fields must be escaped on frontend."

Fix needed: Either sanitize call.result or document why it's safe (if tool results are always system-generated).

⚠️ Significant Issues

4. No Test Coverage

This PR adds significant new functionality but includes no tests:

  • No frontend component tests for ToolUsageIndicator
  • No backend tests for tool result capture in timeline
  • No integration tests verifying tool_call/tool_result pairing

Required tests:

  • Frontend: Test popover rendering, hover interactions, tool call extraction logic
  • Backend: Test that timeline_utils.TimelineBuilder correctly captures tool results
  • Backend: Verify 500-char truncation works correctly
  • Integration: Verify end-to-end tool usage display

5. Accessibility Issues

The popover implementation lacks accessibility features:

  • No keyboard navigation support (can't tab to/open popover)
  • No ARIA attributes (role="tooltip", aria-describedby)
  • No focus management for screen readers
  • Popover positioning may clip on mobile (only uses right: 0)

6. Missing Error Handling (opencontractserver/llms/agents/pydantic_ai_agents.py:772-804)

Tool result extraction has broad except Exception blocks that silently swallow errors:

try:
    rp = event.result.content
    if isinstance(rp, dict):
        ask_result_summary = rp.get("answer", "")
    elif isinstance(rp, str):
        ask_result_summary = rp
except Exception:
    pass  # ← Silent failure

Issue: If tool result extraction fails, users see incomplete information with no indication anything went wrong.

Suggestion: Log exceptions or emit a fallback message like "Error extracting result"

📝 Code Quality Issues

7. Inconsistent Naming (opencontractserver/llms/agents/pydantic_ai_agents.py)

Variables use inconsistent naming:

  • ask_result_summary (line 775)
  • generic_result (line 801)
  • result_summary (line 617)

Consider standardizing to tool_result_summary throughout.

8. Duplicate Logic

The tool result capture logic is repeated 4 times in pydantic_ai_agents.py (lines 612-656, 772-798, 801-812) with only minor variations. This violates DRY principle.

Suggestion: Extract to a helper function:

def extract_tool_result_summary(event, tool_name: str, max_length: int = 500) -> str:
    """Extract and format tool result for timeline."""
    # Centralized logic here

9. Magic String (frontend/src/components/widgets/chat/ChatMessage.tsx:1125)

tool: entry.tool || "Unknown Tool",

"Unknown Tool" should be a constant defined in frontend/src/assets/configurations/constants.ts per CLAUDE.md guidance.

10. Type Safety Issue

The args field is typed as any in both TimelineEntry and ToolCallInfo. Consider using unknown for better type safety:

args?: unknown;

🔍 Performance Considerations

11. Unnecessary Re-renders

The extractToolCalls function runs on every render even though timeline rarely changes. Using useMemo is good, but consider if the parent component is memoized.

12. Missing Index Keys

{toolCalls.map((call, idx) => (
  <ToolCallCard key={idx}>

Using array index as key is an anti-pattern. If timeline changes, React may re-render wrong items. Consider using ${call.tool}-${idx} or adding a unique ID to timeline entries.

🔒 Security Concerns

13. No Rate Limiting on Popover Interactions

A user could rapidly hover/unhover to trigger many state updates. Consider debouncing the handleMouseEnter.

14. Result Truncation Edge Case

The 500-char truncation adds "..." but doesn't indicate HOW MUCH was truncated. Users might miss important context. Consider: (truncated from ${original_length} chars)...

📋 Missing Documentation

15. No JSDoc Comments

New exported functions lack documentation:

  • extractToolCalls - should document the pairing algorithm and edge cases
  • formatToolName - should document expected input format
  • ToolUsageIndicator - should document props and behavior

16. Timeline Schema Documentation

The result field in timeline_schema.py has no comment explaining its purpose, truncation, or format.

🎯 Recommendations

High Priority (Must Fix Before Merge):

  1. Add cleanup to ToolUsageIndicator useEffect
  2. Add test coverage (at minimum, unit tests for extractToolCalls)
  3. Fix or document XSS safety for tool results
  4. Improve tool call/result pairing logic to handle edge cases

Medium Priority (Should Fix):

  1. Add accessibility features (ARIA, keyboard nav)
  2. Extract duplicate tool result logic to helper function
  3. Add proper error handling/logging
  4. Replace magic strings with constants

Low Priority (Nice to Have):

  1. Add JSDoc documentation
  2. Use stable keys instead of array index
  3. Improve truncation UX with character count
  4. Consider debouncing hover interactions

🧪 Testing Recommendations

Before merging, add these tests:

Frontend (Playwright component tests):

yarn test:ct --reporter=list -g "ToolUsageIndicator"
  • Test popover shows on hover
  • Test correct tool count display
  • Test tool call extraction with multiple tools
  • Test formatting of tool names
  • Test truncated results display

Backend (pytest):

docker compose -f test.yml run django pytest opencontractserver/llms/agents/tests/ -n 4 --dist loadscope -k timeline
  • Test TimelineBuilder captures tool results
  • Test 500-char truncation
  • Test result extraction for each tool type
  • Test timeline schema validation

📊 Impact Assessment

  • User-facing changes: High (new UI feature)
  • Breaking changes: None
  • Performance impact: Low (minor memory overhead from timeline results)
  • Security risk: Medium (potential XSS, needs review)
  • Maintenance burden: Low (well-isolated feature)

Verdict

Request Changes - The feature is valuable and well-designed, but needs test coverage and several bug fixes before it's production-ready. The critical issues around memory leaks, tool pairing logic, and XSS must be addressed.

Critical fixes:
- Fix memory leak: add useEffect cleanup for closeTimeoutRef on unmount
- Fix tool call pairing: use consumed-set approach so duplicate calls to
  the same tool are correctly paired with their respective results
- Add keyboard navigation (Enter/Space/Escape) and ARIA attributes
  (role, aria-expanded, aria-haspopup, aria-describedby) to popover

Code quality:
- Extract _extract_tool_result_summary() helper in pydantic_ai_agents
  to eliminate 4x duplicated result extraction with bare except blocks
- Use TOOL_UNKNOWN_LABEL constant from constants.ts instead of magic string
- Use stable call.id keys instead of array index in React list rendering
- Add JSDoc to extractToolCalls, formatToolName, ToolUsageIndicator
- Document XSS safety (React text nodes auto-escape, no innerHTML)
- Improve truncation UX: include original char count in truncation suffix
- Add inline doc to timeline_schema.py result field

Tests:
- Add backend test_timeline_utils.py: 14 tests covering tool result
  capture, truncation, pairing, reset, dict passthrough, type coercion
- Add frontend chatMessageUtils.test.ts: 12 tests covering
  extractToolCalls pairing, duplicate tools, stable IDs, formatToolName

https://claude.ai/code/session_01GHGe5Ye7Ag9uDrw31mSBXc
@claude
Copy link
Copy Markdown

claude Bot commented Feb 11, 2026

PR Review: Add tool usage visibility to chat messages with popover details

Summary

This is a well-implemented feature that significantly improves user visibility into LLM agent tool usage. The PR adds a polished UI indicator showing which tools were used in assistant responses, with detailed input/output information available via hover popover.

Code Quality: Excellent ✅

Strengths

1. Comprehensive Testing

  • Backend: 14 unit tests covering edge cases (truncation, pairing, duplicate tools, missing data)
  • Frontend: 12 unit tests for extraction and formatting logic
  • Tests follow project conventions and are well-documented
  • Edge cases thoroughly covered (missing results, duplicate tool calls, exact-length truncation)

2. DRY Principle Applied

  • _extract_tool_result_summary() helper eliminates 4x duplicated result extraction with bare except blocks (pydantic_ai_agents.py:674-697)
  • Centralized TOOL_UNKNOWN_LABEL constant in constants.ts instead of magic strings
  • Extracted utility functions (extractToolCalls, formatToolName) for testability and reuse

3. Memory Leak Prevention

  • Proper cleanup of closeTimeoutRef in useEffect (ChatMessage.tsx:339-345)
  • This addresses a common React pitfall with setTimeout in components

4. Accessibility

  • Full keyboard navigation (Enter/Space to toggle, Escape to close)
  • Proper ARIA attributes (role, aria-expanded, aria-haspopup, aria-describedby)
  • Screen reader friendly with descriptive labels

5. Correct Pairing Algorithm

  • Consumed-set approach prevents mismatched pairing when same tool called multiple times (ChatMessage.tsx:278-315)
  • Stable React keys using call.id instead of array indices

@claude
Copy link
Copy Markdown

claude Bot commented Feb 11, 2026

6. Security Considerations

  • XSS safety documented: content rendered as React text nodes (auto-escaped), not via dangerouslySetInnerHTML
  • Content originates from backend agent execution, not user input (ChatMessage.tsx:253-255)

7. Documentation

  • Excellent JSDoc comments on exported functions
  • Inline documentation for non-obvious logic (truncation, pairing)
  • CHANGELOG.md properly updated with file paths and line numbers

8. Performance Optimizations

  • 500-char truncation limit keeps WebSocket payloads lightweight
  • useMemo for expensive extractToolCalls computation
  • Debounced popover close (200ms) prevents flickering

Best Practices Adherence ✅

Follows CLAUDE.md conventions:

  • ✅ Uses utility files (constants.ts) for hardcoded values
  • ✅ Comprehensive test coverage for new functionality
  • ✅ Proper CHANGELOG.md update with file locations and impact
  • ✅ DRY architecture with extracted helpers
  • ✅ No magic numbers/strings
  • ✅ Single responsibility principle (ToolUsageIndicator component)

@claude
Copy link
Copy Markdown

claude Bot commented Feb 11, 2026

Areas for Consideration

1. Backend: Inconsistent Result Formatting

The _extract_tool_result_summary() helper returns different formats:

  • ask_document: Returns actual answer text
  • similarity_search: Returns Found X matching annotations
  • search_exact_text: Returns Found X exact text matches
  • Generic fallback: Completed

Suggestion: Consider standardizing the format. If some tools return actual data (like ask_document answer text) while others return counts, users may be confused about what to expect. Either:

  • Always return metadata summaries (Found X, Returned Y chars, etc.)
  • Or always return actual result content (truncated)

File: opencontractserver/llms/agents/pydantic_ai_agents.py:674-697

2. Frontend: TypeScript Type Safety

The args field uses any type in TimelineEntry interface (line 52). Consider using Record<string, unknown> or JsonValue type instead of any for better type safety, consistent with ToolCallInfo.args.

File: frontend/src/components/widgets/chat/ChatMessage.tsx:52

3. UX: Truncation Visibility

When results are truncated, users see ... (truncated from 600 chars) but have no way to view the full content.

Suggestion (optional enhancement): Consider adding a View full output button/link when truncated, or increase MAX_TOOL_RESULT_LENGTH if 500 chars proves too limiting in practice.

File: opencontractserver/llms/agents/timeline_utils.py:84-88

4. UI Polish: Popover Positioning

Popover is positioned right: 0 which works well for most cases, but on mobile its adjusted to right: -1rem. The popover could overflow viewport on very narrow screens or when badge is at screen edge. The current implementation is reasonable, but consider testing on various device widths.

File: frontend/src/components/widgets/chat/ChatMessage.tsx:161-164

@claude
Copy link
Copy Markdown

claude Bot commented Feb 11, 2026

Performance Considerations ✅

Strengths:

  • Truncation at 500 chars prevents payload bloat
  • useMemo prevents re-extraction on every render
  • Timeline filtering is O(n) which is acceptable for typical timeline lengths

No concerns - Performance appears well-optimized.

Security Concerns ✅

Strengths:

  • XSS prevention properly documented and implemented
  • Content escaping handled by React default text node rendering
  • No dangerouslySetInnerHTML usage
  • Backend data flows through proper channels

No security vulnerabilities identified.

Test Coverage ✅

Backend (test_timeline_utils.py):

  • ✅ Result capture and truncation
  • ✅ Pairing logic for multiple same-tool calls
  • ✅ Missing data handling
  • ✅ Type coercion for non-string results
  • ✅ Timeline reset functionality

Frontend (chatMessageUtils.test.ts):

  • ✅ Empty timeline handling
  • ✅ Single and multiple tool calls
  • ✅ Duplicate tool pairing
  • ✅ Missing tool names (TOOL_UNKNOWN_LABEL)
  • ✅ Stable ID generation
  • ✅ formatToolName edge cases

Coverage is comprehensive and follows project testing patterns.

@claude
Copy link
Copy Markdown

claude Bot commented Feb 11, 2026

Bugs or Issues

No blocking issues identified. The code is production-ready.

Minor Nit

File: opencontractserver/llms/agents/timeline_utils.py:87

The truncation message converts tool_result_content to string twice (once on line 83, once in the f-string). For efficiency, consider storing the original length before truncation. This is a very minor optimization and does not affect correctness.

Overall Assessment

LGTM with minor suggestions 🚀

This is high-quality work that demonstrates:

  • Strong understanding of React best practices (hooks, memoization, cleanup)
  • Proper accessibility implementation
  • Comprehensive testing strategy
  • Security awareness
  • Good documentation habits
  • Adherence to project conventions

The suggestions above are minor improvements, not blockers. The PR is ready to merge as-is, with those items potentially addressed in future iterations based on user feedback.

Recommendation: Approve and merge ✅

@claude
Copy link
Copy Markdown

claude Bot commented Feb 11, 2026

PR Review: Tool Usage Visibility in Chat UI

I've completed a thorough review of this PR. Overall, this is excellent work with solid implementation, comprehensive testing, and good documentation. The code quality is high and follows the project's conventions well.


✅ Strengths

1. Comprehensive Test Coverage

  • Frontend: 10 test cases covering edge cases (duplicate tool calls, missing tools, empty timelines, pairing logic)
  • Backend: 14 test cases covering truncation, pairing, metadata handling, and various event types
  • Tests are well-documented with clear docstrings explaining what they verify

2. Security and XSS Prevention

  • Proper XSS safety: Tool results rendered as React text nodes (not dangerouslySetInnerHTML)
  • Good documentation comment explaining safety at ChatMessage.tsx:824-826
  • Input sanitization via JSON formatting for arguments

3. Code Quality

  • DRY principle: Utility functions (extractToolCalls, formatToolName) extracted and tested separately
  • Single Responsibility: ToolUsageIndicator is focused and well-encapsulated
  • Constants: New constant TOOL_UNKNOWN_LABEL properly added to constants file
  • Clean separation between presentation (styled components) and logic (utility functions)

4. Accessibility

  • Proper ARIA attributes: role="button", aria-expanded, aria-haspopup, aria-describedby
  • Keyboard navigation: Enter, Space, and Escape key handlers
  • Good semantic HTML structure

5. Performance Considerations

  • useMemo for extractToolCalls prevents unnecessary recalculation
  • Result truncation at 500 chars keeps payloads lightweight
  • Cleanup of timeout refs on unmount prevents memory leaks
  • Debounced close timeout (200ms) prevents flickering

6. Changelog and Documentation

  • Excellent changelog entries with file locations and detailed descriptions
  • Good inline code comments explaining complex logic
  • Clear XSS safety documentation

🔍 Issues and Recommendations

Critical: Potential Pairing Bug in extractToolCalls

Location: frontend/src/components/widgets/chat/ChatMessage.tsx:876-940

The current implementation searches forward from each tool_call to find the first unconsumed tool_result with matching tool name. However, if tool_result entries appear before their corresponding tool_call in the timeline, they will never be paired.

Recommended fix: Either:

  1. Document the assumption that tool_result always follows tool_call in the timeline (if this is guaranteed by backend)
  2. Make pairing bidirectional or use a two-pass approach to handle out-of-order entries

Impact: Low if backend guarantees ordering, but could cause silent failures if timeline order changes.


Minor: Inconsistent Error Logging

Location: opencontractserver/llms/agents/pydantic_ai_agents.py:621-638

Using logger.error for diagnostic/informational messages. These should be logger.debug or logger.info.

Impact: Low, but pollutes error logs with non-error information.

Recommendation: Change diagnostic logs to appropriate level at lines 621-625, 635-637, and similar patterns.


Code Style: Trailing Commas

The PR includes many formatting changes adding trailing commas. While this follows project conventions, these changes add noise to the diff.

Recommendation: For future PRs, consider running formatters before starting work to avoid mixing formatting changes with feature changes.

Impact: None functionally, but reduces review clarity.


🔒 Security Review

XSS Prevention: Properly handled via React text nodes
Input Validation: Tool arguments formatted via JSON.stringify
Data Truncation: 500-char limit prevents payload bombs
No Injection Risks: No dynamic HTML generation


🎯 Performance Review

Memoization: useMemo for extractToolCalls and hasToolUsage
Lazy Rendering: AnimatePresence only renders popover when open
Cleanup: Proper timeout cleanup on unmount
Data Size: Truncation keeps timeline payloads reasonable

Potential Concern: The extractToolCalls function uses O(n²) algorithm. For very large timelines this could be slow, though timelines are typically small.


📋 Test Coverage Assessment

Frontend: ✅ Excellent (10 test cases, all edge cases covered)
Backend: ✅ Excellent (14 test cases, comprehensive coverage)


🎨 UI/UX Review

Visual Design: Clean, professional badge design
Responsive: Mobile-specific adjustments present
Animations: Smooth transitions with Framer Motion
Feedback: Clear hover states and visual hierarchy


📝 Documentation Quality

Changelog: Excellent detail with file locations
Code Comments: Good explanations for complex logic
Test Docstrings: Clear descriptions of what each test verifies
Type Definitions: Well-documented interfaces


✨ Adherence to CLAUDE.md Guidelines

No Claude credits: Correctly avoided in commit messages
Constants usage: New constant added to constants file
Utility functions: Extracted to testable utilities
Test patterns: Follows project conventions
No dead code: Clean implementation


🎯 Final Recommendation

Approval Status: ✅ Approve with minor suggestions

This is high-quality work that significantly improves the user experience. The implementation is solid, well-tested, and follows project conventions.

Required Before Merge:

  1. Address the extractToolCalls pairing logic - either document the ordering assumption or fix the bidirectional case
  2. Fix diagnostic logging levels - change logger.error to logger.debug for diagnostic messages

Optional Improvements (Post-Merge):

  • Consider O(n) optimization for extractToolCalls if timeline size becomes a concern
  • Add visual indicator for scrollable result blocks
  • Enhance tool result summaries with tool-specific formatting

Great work on this feature! The tool usage visibility will be very helpful for users understanding what the agent is doing. 🚀

@claude
Copy link
Copy Markdown

claude Bot commented Feb 12, 2026

Code Review: Tool Usage Visibility Feature

This PR successfully adds tool usage visibility to the chat UI with a badge and popover showing tool call details. The implementation is well-structured with comprehensive test coverage on both frontend and backend.


✅ Strengths

1. Excellent Architecture & Design

  • Clean separation of concerns: Tool call extraction logic is properly extracted as testable utility functions
  • Smart pairing algorithm: The consumed-set approach in extractToolCalls correctly handles duplicate tool calls
  • Proper TypeScript typing with new interfaces (ToolCallInfo, updated TimelineEntry)
  • Responsive design with mobile-specific adjustments

2. Outstanding Test Coverage

  • Frontend unit tests: 11 comprehensive test cases covering edge cases
  • Backend tests: 12 test cases covering tool result capture, truncation logic, and pairing
  • Well-documented tests with clear descriptions

3. Security & Performance

  • XSS safety: Tool result content rendered as React text nodes (properly documented)
  • Payload optimization: 500-character truncation keeps timeline lightweight
  • Efficient rendering: useMemo hooks prevent unnecessary recalculations

4. Excellent User Experience

  • Accessibility: Proper ARIA attributes and keyboard navigation (Enter, Space, Escape)
  • Visual polish: Smooth animations, hover states, debounced close (200ms)
  • Comprehensive tool info: Shows name, input args (JSON), and output results

5. Code Quality

  • CHANGELOG.md properly updated
  • Constants properly managed (TOOL_UNKNOWN_LABEL in constants.ts)
  • Follows existing codebase patterns
  • Good error handling with fallback to "Completed"

🔍 Issues & Concerns

CRITICAL: Potential React Key Stability Issue

Location: frontend/src/components/widgets/chat/ChatMessage.tsx:891

The key generation uses calls.length which can create non-unique keys:

id: `${entry.tool ?? TOOL_UNKNOWN_LABEL}-${calls.length}`,

Problem: For duplicate tool calls, this can cause React rendering issues or component state loss.

Fix: Use timeline index instead:

id: `${entry.tool ?? TOOL_UNKNOWN_LABEL}-${i}`,  // Use timeline index i

MINOR: Timeline Indicator Logic

Location: frontend/src/components/widgets/chat/ChatMessage.tsx:1898

The old timeline indicator is hidden when tools are used. Consider adding a clarifying comment:

{/* Show step count indicator only when no tools were used (tool badge replaces it) */}
{effectiveHasTimeline && !hasToolUsage && (

MINOR: Backend Tool Result Extraction Duplication

Location: opencontractserver/llms/agents/pydantic_ai_agents.py:626-700

Duplicated result extraction logic for similarity_search and search_exact_text. Consider extracting to the _extract_tool_result_summary helper function to reduce duplication.


MINOR: Error Handling Fallback Message

Location: opencontractserver/llms/agents/pydantic_ai_agents.py:1300-1315

The fallback "Completed" is fine, but consider "Tool execution completed (details unavailable)" for clarity.


📊 Performance Considerations

Good:

  • useMemo prevents recalculation on every render
  • Timeline truncation keeps payloads small
  • Popover only renders when open

Note: extractToolCalls has O(n²) complexity but acceptable for typical small timelines (<20 entries)


🔒 Security Review

Excellent XSS Protection:

  • Tool content rendered safely via React text nodes (not dangerouslySetInnerHTML)
  • Arguments serialized via JSON.stringify
  • Tool names controlled by backend
  • Properly documented in code comments

🎯 Recommendations Summary

Must Fix Before Merge

  1. Fix React key generation in extractToolCalls to use timeline index

Nice to Have

  1. Add clarifying comment about timeline indicator mutual exclusivity
  2. Consider consolidating duplicated tool result summary logic
  3. Add frontend test case for >500 character results

Future Enhancements (Not blocking)

  • Copy-to-clipboard for tool args/results
  • Show execution time if captured

✅ Overall Assessment

High-quality PR with:

  • Comprehensive test coverage (23 total test cases)
  • Proper security considerations
  • Excellent UX with accessibility
  • Clean, maintainable code
  • Proper documentation

Recommendation: Approve with one critical fix (React key stability)

The implementation successfully exposes tool usage in a polished, accessible way. Great work!


🧪 Testing Notes

To verify before merge:

  • Frontend: yarn test:unit chatMessageUtils.test.ts
  • Backend: docker compose -f test.yml run django pytest opencontractserver/tests/test_timeline_utils.py -v
  • TypeScript: yarn tsc --noEmit
  • Pre-commit: pre-commit run --all-files

Per CLAUDE.md baseline rules, ensure all tests pass and TypeScript compiles.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 98.41270% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...encontractserver/llms/agents/pydantic_ai_agents.py 95.38% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

…cation

- Use timeline index for stable React keys in extractToolCalls
- Document forward-only pairing assumption in extractToolCalls
- Change args type from any to Record<string, unknown> on TimelineEntry
- Downgrade DIAGNOSTIC logger.error calls to logger.debug
- Store string length before truncation to avoid double conversion
- Fix prettier formatting across affected files
@claude
Copy link
Copy Markdown

claude Bot commented Feb 13, 2026

PR Review: Add Tool Usage Visibility to Chat Messages

This PR adds a helpful feature that exposes tool usage information to users through a badge and popover UI. The implementation is comprehensive with good test coverage. Here's my detailed feedback:


✅ Strengths

Code Quality

  • Excellent test coverage: Both frontend (Vitest unit tests) and backend (Python unit tests) are well-covered
  • Clean separation of concerns: Tool call extraction logic is separated into testable utility functions
  • Good documentation: Inline comments explain key decisions (consumed-set approach, XSS safety, etc.)
  • Type safety: TypeScript interfaces are well-defined with JSDoc comments

Architecture

  • Smart pairing algorithm: The extractToolCalls function uses a consumed-set approach to correctly pair duplicate tool calls (ChatMessage.tsx:1133-1164)
  • Performance considerations: Uses useMemo for tool call extraction to avoid re-computation
  • Accessibility: ARIA attributes for the popover (role="dialog", aria-expanded, etc.) at ChatMessage.tsx:1227-1233
  • Result truncation: Backend sensibly limits result content to 500 chars to keep payloads lightweight (timeline_utils.py:93-95)

UX

  • Graceful degradation: Tool badge only shows when tools are actually used
  • Responsive design: Mobile-specific styles with appropriate breakpoints
  • Hover debouncing: 200ms delay prevents flickering (ChatMessage.tsx:1206-1208)
  • Cleanup on unmount: Timeout cleanup prevents memory leaks (ChatMessage.tsx:1187-1194)

🔍 Issues & Suggestions

1. Backend: Changed logger.error to logger.debug ⚠️

Location: pydantic_ai_agents.py lines 711-1040 (throughout)

Issue: The PR changes diagnostic logging from logger.error to logger.debug. While this reduces log noise, these diagnostics appear valuable for troubleshooting agent execution issues.

Recommendation:

  • Keep as logger.debug for normal operation
  • Add a comment explaining that these can be temporarily changed to logger.error for debugging if needed
  • Consider using a structured logging approach with log levels that can be adjusted per-module

2. Frontend: Missing keyboard accessibility for popover close

Location: ChatMessage.tsx ToolUsageIndicator component (lines 1180-1288)

Issue: While the badge has keyboard support (Enter/Space to toggle, Escape to close), there's no visual focus indicator when the popover is open, and no focus management when it opens/closes.

Recommendation:

// Add focus trap or at minimum focus management
useEffect(() => {
  if (isOpen && popoverRef.current) {
    // Focus first interactive element or popover itself
    popoverRef.current.focus();
  }
}, [isOpen]);

3. Frontend: Potential issue with popover positioning on mobile

Location: ChatMessage.tsx:735-754

Issue: The popover uses right: 0 positioning, but on mobile it shifts to right: -1rem. This could cause the popover to overflow the viewport on narrow screens.

Recommendation: Consider adding viewport-aware positioning or a max-width constraint:

max-width: min(440px, calc(100vw - 2rem));

4. Backend: _extract_tool_result_summary exception handling too broad

Location: pydantic_ai_agents.py:123-145

Issue: The bare except Exception catches all exceptions, which could hide bugs.

Recommendation: Catch specific exceptions:

except (AttributeError, KeyError, TypeError) as e:
    logger.debug(
        "Could not extract tool result summary for %s: %s", 
        tool_name, str(e), exc_info=True
    )

5. Frontend: XSS comment could be more precise

Location: ChatMessage.tsx:842-844

Issue: The comment states "Content originates from backend agent tool execution, not user input." However, tool arguments CAN include user input (e.g., search queries).

Recommendation: Clarify the comment:

// XSS safety: Content is rendered as React text nodes (auto-escaped).
// Tool args may contain user input, but React's JSX escaping handles this safely.
// Tool results are LLM/tool outputs and also auto-escaped.

6. Backend: Inconsistent result emission for tools

Location: pydantic_ai_agents.py:633-950

Issue: Only similarity_search, search_exact_text, and ask_document get explicit result summaries. Other tools use the generic _extract_tool_result_summary which may return "Completed" for most tools.

Recommendation: Document this behavior or add result summaries for more tools. For example:

  • add_annotations_from_exact_strings: "Added X annotations"
  • `update_document_summary": "Summary updated successfully"

7. Frontend: Popover z-index could conflict

Location: ChatMessage.tsx:739

Issue: z-index: 100 might conflict with other UI elements. The codebase has various z-index values (e.g., LabelMenu uses 2000).

Recommendation: Define z-index constants in the constants file to maintain a z-index hierarchy:

export const Z_INDEX = {
  TOOL_POPOVER: 1000,
  LABEL_MENU: 2000,
  // ... other values
} as const;

📊 Test Coverage Assessment

Frontend Tests ✅

  • Comprehensive unit tests for extractToolCalls (9 test cases)
  • Edge cases covered: empty timeline, missing tool names, duplicate tools, pairing logic
  • Test for formatToolName utility
  • Missing: Component tests for ToolUsageIndicator (hover behavior, keyboard interaction)

Backend Tests ✅

  • Excellent coverage for TimelineBuilder (13 test cases)
  • Covers truncation, pairing, missing metadata, type conversion
  • Missing: Integration test showing full flow from tool execution → timeline → frontend

🔒 Security Review

XSS Prevention: ✅ Safe

  • Content rendered via React text nodes (auto-escaped)
  • JSON.stringify used for displaying args (safe within text nodes)
  • No use of dangerouslySetInnerHTML

Data Truncation: ✅ Appropriate

  • 500 char limit prevents DoS via massive tool outputs
  • Truncation message shows original length for transparency

🎯 Performance Considerations

Frontend:

  • useMemo used for tool call extraction
  • ✅ Cleanup on unmount prevents memory leaks
  • ⚠️ Framer Motion adds ~30KB to bundle (acceptable for enhanced UX)

Backend:

  • ✅ Result truncation keeps WebSocket payloads lightweight
  • ✅ Minimal processing overhead (string slicing + metadata extraction)

📝 Changelog Compliance

✅ Changelog follows "Keep a Changelog" format
✅ Includes file paths and component names
✅ Clear description of user-facing changes

Minor suggestion: Add a "Technical Details" subsection mentioning the pairing algorithm and truncation strategy for future maintainers.


🎨 Code Style & Conventions

✅ Follows project patterns (styled-components, Jotai atoms)
✅ Responsive design with mobile breakpoints
✅ Consistent with existing timeline/source indicators
✅ No dead code or commented-out sections


🚀 Recommendations Before Merge

High Priority

  1. Fix overly broad exception handling in _extract_tool_result_summary
  2. Add focus management for keyboard accessibility in popover
  3. Add max-width constraint for mobile popover positioning

Medium Priority

  1. Clarify XSS safety comment regarding user input in tool args
  2. Consider adding result summaries for more tools (not just search tools)
  3. Define z-index constants to prevent future conflicts

Low Priority

  1. Add component tests for ToolUsageIndicator hover/keyboard behavior
  2. Consider adding integration test for full tool visibility flow
  3. Add note in changelog about consumed-set pairing algorithm

✨ Overall Assessment

Verdict: ✅ Approve with minor revisions

This is a well-implemented feature that significantly improves transparency around agent tool usage. The code quality is high, test coverage is strong, and the UX is thoughtful. The issues identified are mostly minor and can be addressed quickly.

Estimated time to address high-priority items: ~30-60 minutes

Great work on this enhancement! 🎉

JSv4 added 2 commits February 16, 2026 13:50
…over

10 functional tests covering badge visibility, hover/keyboard interaction,
ARIA accessibility, multi-tool display, and pending tool state. 3 screenshot
tests capturing the badge, single-tool popover, and multi-tool popover for
auto-generated documentation.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 16, 2026

Code Review: Tool Usage Visibility in Chat Messages

I've completed a thorough review of PR #880. This is an excellent feature addition that enhances transparency by exposing agent tool usage to users. The implementation is well-structured with comprehensive tests and documentation. Below are my findings:


Strengths

Architecture & Design

  • Clean separation of concerns: Frontend display logic (ToolUsageIndicator), backend data capture (TimelineBuilder), and schema definition (timeline_schema.py) are properly separated
  • Framework-agnostic timeline: The unified timeline approach in TimelineBuilder ensures consistency across different LLM frameworks
  • Proper state management: Uses Jotai atoms pattern correctly, integrates well with existing message architecture
  • Accessibility-first: Comprehensive ARIA attributes (role, aria-expanded, aria-haspopup, aria-describedby) and keyboard navigation (Enter/Space/Escape)

Code Quality

  • DRY principle followed: _extract_tool_result_summary() helper eliminates code duplication across tool handlers (line 123-145 in pydantic_ai_agents.py)
  • Defensive coding: Handles missing metadata gracefully, includes truncation with character count display
  • Type safety: Proper TypeScript interfaces (ToolCallInfo, TimelineEntry) and Python TypedDict
  • Constants usage: Uses TOOL_UNKNOWN_LABEL and MAX_TOOL_RESULT_LENGTH constants instead of magic values

Testing

  • Comprehensive test coverage:
    • Backend: 14 tests in test_timeline_utils.py covering capture, truncation, pairing, edge cases
    • Frontend unit: 12 tests in chatMessageUtils.test.ts for extraction/formatting logic
    • Component: 10 functional tests + 3 screenshot tests in ToolUsagePopover.ct.tsx
  • Test quality: Tests verify real user scenarios (hover/keyboard interaction, multiple tools, pending state, ARIA attributes)
  • Documentation screenshots: Automated screenshot capture integrated into CI workflow

🔍 Issues & Recommendations

1. Memory Leak Already Fixed

The useEffect cleanup for closeTimeoutRef (lines 1188-1194) properly prevents memory leaks. Good catch in the second commit.

2. React Key Stability

Using timeline index in the key is stable for this use case since timeline entries are append-only and never reordered (line 1144 in ChatMessage.tsx).

3. Tool Call Pairing Logic

The consumed-set approach (lines 1133-1163) correctly handles duplicate tool calls. The forward-only search assumption is documented and valid.

4. Performance Consideration: Memoization

The extractToolCalls function runs on every render via useMemo(() => extractToolCalls(timeline), [timeline]) (line 1183). For long timelines with many tool calls, this could be expensive. However:

  • Current implementation is probably fine for typical use (most messages have <5 tool calls)
  • The memoization already prevents unnecessary recalculations when timeline has not changed
  • No action needed unless profiling shows issues

5. XSS Safety

Line 842-844 documentation correctly notes that content is rendered as React text nodes (auto-escaped), not via dangerouslySetInnerHTML. Safe.

6. Backend: Type Coercion Safety

In timeline_utils.py line 83-89, the truncation logic correctly stores length after conversion. The test at line 61-79 in test_timeline_utils.py confirms this works correctly.

7. Logging Levels

The switch from logger.error to logger.debug for diagnostic messages (commit 5) is correct - these are not operational errors, just trace information.

8. Frontend: Styled Component Complexity

ChatMessage.tsx is now 1985 lines with many styled components. While well-organized, consider:

  • Future refactor: Extract tool-related styled components to ChatMessage.styles.ts or a separate ToolUsagePopover component file
  • Not blocking: Current organization is acceptable, just a maintainability suggestion for future work

9. Truncation UX

Line 87-89 shows truncated char count: "... (truncated from 600 chars)" - excellent UX decision. User knows how much was cut.

10. Popover Positioning

The popover uses right: 0 positioning (line 738). On mobile or narrow containers, this could clip off-screen. The test at line 391-392 widens the viewport and adds padding, acknowledging this.

Recommendation: Consider using a tooltip library (like Radix UI Popover) that handles auto-positioning based on viewport boundaries. Not blocking - current implementation works for typical use cases.


🔒 Security Review

  • ✅ No IDOR vulnerabilities (tool data comes from backend, tied to message ownership)
  • ✅ XSS prevention: React auto-escapes text nodes
  • ✅ No SQL injection risks (no dynamic queries)
  • ✅ No authentication bypass concerns
  • ✅ Tool result content originates from backend agent execution (trusted source)

📊 Performance Assessment

  • ✅ Timeline truncation (500 chars) keeps payloads lightweight
  • ✅ Memoization prevents unnecessary re-extraction
  • ✅ Popover uses AnimatePresence for smooth mount/unmount
  • ✅ Hover debounce (200ms close timeout) prevents flickering
  • ⚠️ Watch for: Very long timelines (>50 tools) could slow down extraction, but unlikely in practice

📝 Documentation & Changelog

  • ✅ CHANGELOG.md properly documents the feature with file locations
  • ✅ JSDoc comments on key functions (extractToolCalls, formatToolName)
  • ✅ Inline schema documentation in timeline_schema.py
  • ✅ Test comments explain behavior and assumptions
  • ✅ Screenshot tests generate auto-updated documentation images

🧪 Test Coverage Assessment

Backend

  • ✅ Tool result capture with/without content
  • ✅ Truncation at/above/below MAX_TOOL_RESULT_LENGTH
  • ✅ Tool call/result pairing (single, multiple, duplicate tools)
  • ✅ Edge cases (empty result, non-string values, reset)

Frontend Unit

  • ✅ Empty timeline, single tool, multiple tools
  • ✅ Duplicate tool pairing with stable IDs
  • ✅ Missing tool names (TOOL_UNKNOWN_LABEL fallback)
  • ✅ snake_case to Title Case formatting

Frontend Component

  • ✅ Badge visibility conditions (assistant vs user, with/without tools)
  • ✅ Popover interaction (hover, keyboard, escape)
  • ✅ ARIA attributes and accessibility
  • ✅ Multiple tool display with all details
  • ✅ Pending tool state (call without result yet)
  • ✅ Screenshot capture for documentation

Coverage is excellent - all critical paths tested.


Final Verdict

APPROVED

This is a well-engineered feature that follows the project's conventions and best practices. The code is:

  • Clean and maintainable
  • Thoroughly tested
  • Accessible
  • Secure
  • Well-documented

Minor Suggestions (Non-Blocking)

  1. Consider extracting tool popover into separate component file in future refactor (maintainability)
  2. Consider viewport-aware positioning library for popovers (UX enhancement)

Both are nice-to-haves and can be addressed in follow-up work if needed.


📋 Checklist

  • ✅ Code quality and best practices
  • ✅ No bugs or issues identified
  • ✅ Performance considerations addressed
  • ✅ Security review passed
  • ✅ Test coverage comprehensive
  • ✅ Documentation complete
  • ✅ Follows CLAUDE.md conventions
  • ✅ CHANGELOG.md updated
  • ✅ No dead code introduced
  • ✅ DRY principle followed
  • ✅ Single Responsibility Principle maintained

Great work! 🎉

…ettle

Screenshots were faded because ChatMessage has transparent backgrounds
designed for the chat panel's #f8fafc bg and a framer-motion entrance
animation (opacity 0→1, 300ms). Added the proper background wrapper
and a 500ms settle delay before capture.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 16, 2026

Code Review: Tool Usage Visibility Enhancement

This PR successfully implements tool usage visibility in the chat UI with comprehensive testing and well-structured code. Here's my detailed feedback:

Strengths

1. Excellent Test Coverage

  • Backend: test_timeline_utils.py (234 lines) provides thorough coverage of tool result capture, truncation, pairing logic, and edge cases
  • Frontend Unit Tests: chatMessageUtils.test.ts tests the extraction and formatting logic with 12 test cases covering duplicate tools, missing data, and edge cases
  • Frontend Component Tests: ToolUsagePopover.ct.tsx (480 lines) includes 14 tests covering UI behavior, keyboard navigation, ARIA attributes, and documentation screenshots
  • All tests follow the project's established patterns (e.g., using --reporter=list for Playwright)

2. Clean Architecture

  • Separation of concerns: extractToolCalls() and formatToolName() are pure functions tested separately from the UI component
  • TimelineBuilder centralization: Tool result logic is properly centralized in timeline_utils.py rather than duplicated across adapters
  • Consumed-set pairing algorithm: The extractToolCalls() function uses a smart algorithm to correctly pair tool_call/tool_result entries when the same tool is called multiple times (ChatMessage.tsx:1133-1164)

3. Security & Accessibility

  • XSS safety: Proper comment at line 842-844 documenting that tool content is rendered as React text nodes (auto-escaped)
  • Keyboard navigation: Full keyboard support with Enter/Space to toggle, Escape to close
  • ARIA attributes: Proper role="button", aria-expanded, aria-haspopup, aria-describedby on badge and role="dialog" on popover
  • Debounced hover: 200ms delay on mouse leave prevents flickering

4. Performance Considerations

  • Result truncation: MAX_TOOL_RESULT_LENGTH (500 chars) keeps payloads lightweight while providing useful visibility
  • useMemo hooks: Tool detection and extraction use useMemo to avoid re-computation
  • Cleanup: Proper timeout cleanup in useEffect to prevent memory leaks (ChatMessage.tsx:1188-1194)

5. User Experience

  • Progressive disclosure: Tool usage indicated by badge, details on demand via popover
  • Visual consistency: Blue theme (#3b82f6) for tool indicators matches existing UI patterns
  • Responsive design: Mobile-specific adjustments for badge and popover sizing
  • Framer Motion animations: Smooth 150ms transitions for popover open/close

🔍 Potential Issues & Suggestions

1. Backend: Tool Result Summary Extraction ⚠️

Location: pydantic_ai_agents.py:123-146

The _extract_tool_result_summary() function has a potential issue:

def _extract_tool_result_summary(event: Any, tool_name: str) -> str:
    try:
        result_content = event.result.content  # type: ignore[attr-defined]
        if isinstance(result_content, dict):
            # ask_document returns {"answer": ..., "sources": ..., "timeline": ...}
            summary = result_content.get("answer", "")
            if summary:
                return str(summary)
        elif isinstance(result_content, str) and result_content:
            return result_content
        elif result_content is not None:
            return str(result_content)
    except Exception:
        logger.debug(
            "Could not extract tool result summary for %s", tool_name, exc_info=True
        )
    return "Completed"

Problem: For ask_document, this will extract the full answer text which could be very long (easily exceeding 500 characters). The subsequent truncation in TimelineBuilder (timeline_utils.py:85-89) only truncates at the timeline level, but the metadata passed through ThoughtEvent already contains the full content.

Suggestion: Truncate at source in _extract_tool_result_summary():

def _extract_tool_result_summary(event: Any, tool_name: str) -> str:
    from .timeline_utils import MAX_TOOL_RESULT_LENGTH
    
    try:
        result_content = event.result.content
        summary = ""
        
        if isinstance(result_content, dict):
            summary = result_content.get("answer", "")
        elif isinstance(result_content, str):
            summary = result_content
        elif result_content is not None:
            summary = str(result_content)
        
        if summary:
            # Truncate early to avoid passing large strings through metadata
            if len(summary) > MAX_TOOL_RESULT_LENGTH:
                summary = summary[:MAX_TOOL_RESULT_LENGTH] + "..."
            return summary
    except Exception:
        logger.debug(
            "Could not extract tool result summary for %s", tool_name, exc_info=True
        )
    return "Completed"

2. Frontend: Word Breaking Behavior 🎨

Location: ChatMessage.tsx:208

The message content has word-break: break-all which breaks words at any character, making text harder to read:

word-wrap: break-word;
overflow-wrap: break-word;
word-break: break-all;  // ⚠️ Too aggressive

Suggestion: Use word-break: break-word instead (deprecated but better UX) or remove it entirely and rely on overflow-wrap: break-word:

word-wrap: break-word;
overflow-wrap: break-word;
// word-break removed - overflow-wrap handles long URLs/tokens gracefully

3. Constants Organization 📁

Location: constants.ts (new addition)

The PR adds TOOL_UNKNOWN_LABEL to the frontend constants file, but this is already defined in the backend. Consider:

  • Document why this duplication is necessary (backend/frontend need separate constants)
  • OR use a shared constants approach if the project has one

4. Changelog Completeness 📝

Location: CHANGELOG.md

The changelog entry is good but missing:

  • File locations and line numbers for key changes (per CLAUDE.md guidelines)
  • Impact statement (e.g., "Users can now see which tools the AI used without expanding the full timeline")

Suggested addition:

### Added
- Tool usage visibility in chat UI (frontend/src/components/widgets/chat/ChatMessage.tsx:1180-1288)
  - Blue badge showing "X tools used" appears in assistant message header when tools are executed
  - Hover popover displays each tool call's name (formatted), input arguments (JSON), and output result
  - Keyboard accessible (Enter/Space to toggle, Escape to close) with full ARIA attributes
  - Impact: Users can quickly see AI's tool usage without expanding the full timeline, improving transparency

### Changed
- Timeline schema updated to include `result` field in tool_result entries (opencontractserver/llms/agents/timeline_schema.py:52)
- Timeline builder now captures tool result content from metadata, truncated to 500 chars (opencontractserver/llms/agents/timeline_utils.py:77-92)
- Agent handlers emit tool result summaries for all tools (opencontractserver/llms/agents/pydantic_ai_agents.py:642-657, 686-702, 807-813)

5. Type Safety 🔒

Location: ChatMessage.tsx:53

The TimelineEntry.result field is string | undefined, but the component doesn't validate this at runtime before rendering. Consider adding a type guard or null check where it's used.

Current:

{call.result && (
  <ToolCallSection>
    <ToolCallResultBlock>{call.result}</ToolCallResultBlock>
  </ToolCallSection>
)}

This is actually correct! The call.result && check properly handles undefined. ✅


🧪 Testing Recommendations

Run Backend Tests in Parallel:

docker compose -f test.yml run django pytest opencontractserver/tests/test_timeline_utils.py -n 4 --dist loadscope

Run Frontend Component Tests:

cd frontend
yarn test:ct --reporter=list -g "Tool Usage Popover"

Manual Testing Scenarios:

  1. Multiple identical tool calls: Verify pairing is correct when similarity_search is called 3 times
  2. Long tool results: Test with results exceeding 500 characters to verify truncation indicator appears
  3. Mobile responsiveness: Test popover positioning on small screens (should not overflow viewport)
  4. Screen reader: Test with NVDA/JAWS to verify ARIA announcements

📊 Performance Impact

  • Bundle size: ~15KB added (styled-components, helper functions, Framer Motion already present)
  • Runtime overhead: Negligible - extractToolCalls is O(n²) worst case but n is typically <10
  • Network payload: Timeline entries grow by ~50-500 chars per tool call (acceptable)

🎯 Summary

Overall Assessment: APPROVE with minor suggestions

This is a well-executed feature with:

  • ✅ Comprehensive test coverage (backend + frontend unit + component)
  • ✅ Clean architecture following project patterns
  • ✅ Excellent accessibility and security practices
  • ✅ Good performance characteristics

Recommended Actions:

  1. High priority: Add truncation to _extract_tool_result_summary() to avoid passing large strings through metadata
  2. Medium priority: Change word-break: break-all to improve text readability
  3. Low priority: Enhance changelog with file locations and impact statement
  4. Low priority: Document TOOL_UNKNOWN_LABEL duplication rationale

Blocking Issues: None

The code is production-ready. The suggestions above are optimizations that can be addressed in follow-up commits if needed.


Great work on this feature! The popover design is clean and the implementation is robust. 🚀

github-actions Bot and others added 6 commits February 16, 2026 21:11
Expand step-7 with screenshots showing the tool usage badge,
single-tool popover, and multi-tool popover. Includes a reference
table of available agent tools.
…hangelog

- Truncate tool result summaries at source in _extract_tool_result_summary()
  using MAX_TOOL_RESULT_LENGTH (500 chars) to avoid passing large strings
  through ThoughtEvent metadata

- Replace word-break: break-all with word-break: break-word for more
  natural text wrapping in message content

- Fix CI screenshot workflow: backup freshly generated screenshots before
  rebase, abort failed rebase and fall back to merge, then force-restore
  screenshots regardless of conflict state. Also add --reporter=list flag
  to prevent test hangs in CI.

- Enhance changelog with file locations, line numbers, and impact statements
@claude
Copy link
Copy Markdown

claude Bot commented Feb 16, 2026

PR Review: Add Tool Usage Visibility to Chat Messages

Overview

This PR successfully implements tool usage transparency in the chat UI by adding a visual indicator badge and detailed popover showing tool call inputs and outputs. The implementation follows the project's architectural patterns and includes comprehensive testing.

✅ Strengths

1. Excellent Architecture & Code Organization

  • Clean separation of concerns: Timeline data flows from backend → frontend in a well-structured pipeline
  • Type safety: Proper TypeScript interfaces (TimelineEntry, ToolCallInfo) with schema validation
  • Reusable utilities: Helper functions (extractToolCalls, formatToolName) are properly tested and documented
  • DRY principle: Centralized timeline building logic in TimelineBuilder prevents duplication across adapters

2. Strong Testing Coverage

  • Backend: Comprehensive unit tests in test_timeline_utils.py covering truncation, pairing, edge cases (233 lines)
  • Frontend:
    • Unit tests for utility functions in chatMessageUtils.test.ts (151 lines)
    • Component tests in ToolUsagePopover.ct.tsx with visual regression screenshots (480 lines)
  • Edge case coverage: Empty timelines, missing results, duplicate tool calls, truncation boundaries

3. Performance & UX Considerations

  • Truncation at source: Tool results truncated to 500 chars in backend before transmission (prevents payload bloat)
  • Smart pairing algorithm: Consumed-set approach prevents mispairing when same tool called multiple times
  • Smooth animations: Framer Motion with proper cleanup (useEffect unmount cleanup prevents memory leaks)
  • Accessibility: Full ARIA attributes, keyboard navigation (Enter/Space/Escape), proper role attributes
  • Mobile responsive: Separate breakpoints for tablet/mobile with adjusted sizing

4. Documentation & Maintainability

  • Inline comments: Clear explanations of complex logic (pairing algorithm, XSS safety notes)
  • CHANGELOG updated: Detailed entries with file locations and line numbers
  • Automated screenshots: Uses new docScreenshot utility for visual documentation
  • Constants file usage: TOOL_UNKNOWN_LABEL properly defined in constants (avoiding magic strings)

🔍 Areas for Improvement

1. Security Considerations

XSS Risk in Tool Result Display (Medium Priority)

  • Location: frontend/src/components/widgets/chat/ChatMessage.tsx:844-857
  • Issue: While the comment claims content is "auto-escaped", tool results are rendered as React text nodes which is correct. However, the ToolCallCodeBlock uses white-space: pre-wrap which could preserve malicious formatting.
  • Recommendation:
    • Add explicit sanitization for tool results that might contain user-controlled content
    • Consider using a dedicated sanitization library like DOMPurify for any content that might include special characters
    • Add a test case with malicious input patterns
// Suggested addition in ChatMessage.tsx
import DOMPurify from 'dompurify';

// In ToolCallResultBlock rendering:
<ToolCallResultBlock>
  {DOMPurify.sanitize(call.result, { ALLOWED_TAGS: [] })}
</ToolCallResultBlock>

2. Code Quality Issues

Potential Race Condition in Timeout Management (Low Priority)

  • Location: ChatMessage.tsx:1206-1207
  • Issue: The closeTimeoutRef is checked and cleared but there's a race condition if the component unmounts between check and clear
  • Recommendation:
const handleMouseLeave = () => {
  if (closeTimeoutRef.current) {
    clearTimeout(closeTimeoutRef.current);
  }
  closeTimeoutRef.current = setTimeout(() => setIsOpen(false), 200);
};

Missing Null Check in Backend (Low Priority)

  • Location: opencontractserver/llms/agents/pydantic_ai_agents.py:136
  • Issue: event.result.content access without null check could throw AttributeError
  • Recommendation: Add defensive check:
if not hasattr(event, 'result') or not hasattr(event.result, 'content'):
    return "Completed"
result_content = event.result.content

3. Testing Gaps

Missing Test Cases:

  1. Popover positioning edge cases: What happens when popover would overflow viewport?
  2. Concurrent tool calls: Timeline with interleaved tool calls (call A → call B → result B → result A)
  3. Very long tool names: How does formatToolName handle extremely long snake_case strings?
  4. Unicode in tool results: Test emoji, RTL text, special characters
  5. Backend: No tests for _extract_tool_result_summary error handling paths

Recommendation: Add test cases:

// Frontend test
it("handles extremely long tool names gracefully", () => {
  const longName = "very_long_tool_name_with_many_segments_that_exceeds_normal_length";
  const result = formatToolName(longName);
  expect(result.length).toBeLessThan(100); // Add length constraint
});

4. Performance Considerations

Unnecessary Re-renders (Low Priority)

  • Location: ChatMessage.tsx:1182
  • Issue: useMemo for extractToolCalls is good, but hasToolUsage is also memoized separately
  • Recommendation: Combine into single memo:
const toolCallInfo = useMemo(() => {
  const calls = extractToolCalls(timeline);
  return { calls, hasToolUsage: calls.length > 0 };
}, [timeline]);

Timeline Builder N+1 Pattern (Low Priority)

  • Location: timeline_utils.py:77-92
  • Issue: String concatenation in tight loop could be inefficient for very long results
  • Recommendation: Already handled by truncation, but consider using string builder pattern for future

5. Documentation Issues

Incomplete ARIA Labels

  • Location: ChatMessage.tsx:1243-1245
  • Issue: aria-describedby only set when popover is open, but the badge should always have a description
  • Recommendation:
<ToolBadge
  aria-label={`${toolCalls.length} ${toolCalls.length === 1 ? "tool" : "tools"} used. Press Enter to view details.`}
  // ... other props
/>

Missing JSDoc Comments

  • Helper functions like _to_source_node and _extract_tool_result_summary lack docstrings
  • Frontend utility functions lack JSDoc

🎯 Critical Issues

None Found

The PR is well-implemented with no blocking issues. The suggestions above are enhancements.

📊 Code Metrics

  • Lines Added: 1,452
  • Lines Deleted: 69
  • Files Changed: 23
  • Test Coverage: Strong (3 test files, 864+ lines of tests)
  • Documentation: CHANGELOG updated, inline comments present

🚀 Recommendations for Merge

Before Merge (Optional but Recommended):

  1. Add XSS sanitization for tool results
  2. Fix timeout race condition in handleMouseLeave
  3. Add null check in _extract_tool_result_summary
  4. Add missing ARIA label enhancement

Follow-up PRs (Non-blocking):

  1. Add unicode and edge case test coverage
  2. Add JSDoc comments to utility functions
  3. Consider performance optimization for very large timelines (100+ entries)

💡 Suggestions

Future Enhancements:

  1. Tool execution time: Show duration for each tool call
  2. Failed tool calls: Visual indicator when tool execution fails
  3. Collapsible tool groups: When 5+ tools used, group similar tools
  4. Copy functionality: Button to copy tool inputs/outputs for debugging
  5. Timeline export: Download full timeline as JSON for analysis

Consistency with CLAUDE.md:

  • ✅ No Claude/Claude Code credits in commits
  • ✅ Tests follow parallel testing patterns
  • ✅ Uses constants file for magic values
  • ✅ Follows DRY principle
  • ✅ Single Responsibility Principle maintained
  • ✅ CHANGELOG updated with proper format

Final Verdict: APPROVE with Minor Suggestions

This is a high-quality PR that adds valuable functionality with excellent test coverage and documentation. The implementation follows project conventions and architectural patterns. The suggested improvements are minor and non-blocking.

Estimated Risk: Low
Merge Recommendation: Approve (address security suggestions before or after merge based on urgency)

@claude
Copy link
Copy Markdown

claude Bot commented Feb 16, 2026

Code Review - PR #880: Tool Usage Visibility

I've completed my review of this PR. Overall, this is excellent work with strong implementation quality, comprehensive testing, and good documentation. Here's my detailed feedback:


✅ Strengths

Architecture & Design

  • Clean separation of concerns: Timeline building logic is properly centralized in TimelineBuilder rather than duplicated across adapters
  • Framework-agnostic approach: The TimelineEntry schema and UnifiedStreamEvent pattern ensure this will work consistently across different LLM frameworks
  • Progressive enhancement: The feature gracefully degrades when tool results aren't available (shows tool name only)
  • Accessibility-first: Full ARIA attributes, keyboard navigation (Enter/Space/Escape), and proper roles

Code Quality

  • DRY principle: The _extract_tool_result_summary() helper eliminates 4x duplicated result extraction logic (lines 123-155 in pydantic_ai_agents.py)
  • Defensive programming: Proper handling of missing/malformed data, truncation at source to prevent bloat
  • Type safety: Uses Record<string, unknown> instead of any for args, proper TypeScript types throughout
  • Memory leak prevention: useEffect cleanup for timeout refs (ChatMessage.tsx:1187-1193)

Testing

  • Comprehensive coverage:
    • 14 backend tests covering pairing, truncation, edge cases
    • 12 frontend unit tests for extraction logic
    • 10 functional Playwright tests covering interaction patterns
    • 3 documentation screenshot tests
  • Test quality: Tests are well-named, focused, and cover both happy paths and edge cases
  • Proper use of --reporter=list in CI workflow to prevent hanging (screenshots.yml:50)

Documentation

  • Excellent changelog: Includes file paths, line numbers, and impact statements (CLAUDE.md compliant)
  • Auto-screenshot system: Smart use of CI to keep docs fresh
  • JSDoc comments: Good inline documentation on helper functions

🔍 Issues Found

Critical Issues

None found.

Code Quality Issues

1. Inconsistent truncation location (Minor)

Location: pydantic_ai_agents.py:147-148 and timeline_utils.py:83-89

The PR description claims truncation happens "at source" in _extract_tool_result_summary(), but actually:

  • _extract_tool_result_summary() truncates to 500 chars (line 147-148)
  • TimelineBuilder.add() ALSO truncates to 500 chars (line 83-89)

Impact: Double truncation logic creates maintenance burden. If MAX_TOOL_RESULT_LENGTH changes, both need updating.

Recommendation: Remove truncation from TimelineBuilder.add() since it's already done at source. Update line 81 comment to note that truncation is already applied upstream.

2. Pairing algorithm could be more efficient (Minor)

Location: ChatMessage.tsx:1132-1163

The current O(n²) nested loop works fine for typical timelines (5-10 entries), but could be O(n) with a single pass using a stack.

Current approach:

for (let i = 0; i < timeline.length; i++) {
  if (entry.type !== "tool_call") continue;
  for (let j = i + 1; j < timeline.length; j++) {  // O(n²)
    if (timeline[j].type === "tool_result" && timeline[j].tool === entry.tool) {
      // pair found
    }
  }
}

Impact: Negligible for real-world usage (timelines rarely exceed 20 entries), but worth noting.

Recommendation: Keep current implementation for clarity, but add a comment noting the complexity tradeoff.

3. word-break CSS property (Very Minor)

Location: ChatMessage.tsx:207

Changed from break-all to break-word. While break-word provides better UX, it's deprecated in favor of overflow-wrap: break-word.

Recommendation: Use overflow-wrap: break-word instead (better browser support going forward).


💡 Suggestions

Performance Optimizations

1. Memoize tool call extraction

The extractToolCalls function is already memoized via useMemo (line 1182), which is good. However, the memoization key is the entire timeline array, which creates a new reference on every update.

Suggestion: Consider memoizing based on timeline.length or a hash of tool call entries only, though the current approach is probably fine for typical use cases.

2. Debounced hover close timeout

The 200ms delay (line 1206) is reasonable, but consider making it a constant at the file level for easier tuning.

Accessibility Enhancements

1. Popover focus management

When the popover opens via keyboard (Enter/Space), focus should move into the popover for screen reader navigation. Currently focus stays on the badge.

Suggestion: Add autoFocus to the first focusable element in the popover, or use a focus trap pattern.

2. Live region for tool results

When streaming tool results in real-time, screen reader users don't know when results arrive. Consider adding aria-live="polite" to the result blocks.

Security

1. XSS prevention documentation is excellent

The comment on lines 841-843 is perfect:

// XSS safety note: Tool result and argument content is rendered as React text
// nodes (not via dangerouslySetInnerHTML), so HTML entities are auto-escaped.
// Content originates from backend agent tool execution, not user input.

No issues here - just wanted to highlight this as a best practice example.


📋 Minor Observations

Frontend

  1. Magic numbers: The 500ms animation settle time (ToolUsagePopover.ct.tsx:392) could be a named constant
  2. Truncation UX: The truncation suffix includes original char count, which is great UX (timeline_utils.py:88)
  3. Stable React keys: Good use of timeline index for stable keys (ChatMessage.tsx:1143)

Backend

  1. Logging levels: Good use of logger.debug for diagnostic logs (not logger.error as mentioned in the changelog)
  2. Type coercion: str() fallback for non-string tool results is defensive (timeline_utils.py:86)
  3. Forward-only pairing: The assumption that tool_result always follows tool_call is documented well (ChatMessage.tsx:1129-1130)

CI/Screenshots

  1. Rebase conflict handling: The workflow has robust fallback logic (screenshots.yml:88-106), though the complexity suggests this might be a common issue
  2. Viewport sizes: Good use of larger viewports (1440x720, 1536x864) for readability

🎯 Test Coverage

What's tested:

✅ Tool call/result pairing (including duplicates)
✅ Truncation logic
✅ Missing tool names (TOOL_UNKNOWN_LABEL fallback)
✅ Keyboard navigation
✅ ARIA attributes
✅ Hover interactions
✅ Multiple tool calls
✅ Pending tool state (no result yet)

What could be tested:

⚠️ Tool result with HTML/script tags - verify XSS prevention in practice
⚠️ Very long tool names (>50 chars) - does the badge truncate/wrap properly?
⚠️ Popover positioning edge cases - what happens when badge is at viewport edge?

None of these are critical, but worth considering for future test additions.


📊 Performance Considerations

Positive:

  • ✅ Truncation at source prevents large payloads in WebSocket
  • ✅ Memoization prevents unnecessary re-extraction
  • ✅ Conditional rendering (only show popover when open)

Potential concerns:

  • ⚠️ The timeline is stored in ChatMessage.data['timeline'] as JSON - for conversations with 100+ messages, this could add up. Consider monitoring DB size impact.
  • ⚠️ The popover uses Framer Motion for animations, which is ~50KB gzipped. Since this is already in use elsewhere in the app, no issue.

🔐 Security Assessment

Threat Model:

  1. XSS via tool results: ✅ Mitigated (React text nodes auto-escape)
  2. XSS via tool names: ✅ Mitigated (formatToolName only does case transformation)
  3. Content injection: ✅ No dangerouslySetInnerHTML usage
  4. Data leakage: ✅ Tool results are scoped to message permissions (already verified in corpus/document permissions)

No security concerns found.


🏁 Conclusion

This PR is ready to merge pending minor fixes:

Required before merge:

  • Remove duplicate truncation logic from TimelineBuilder.add() OR _extract_tool_result_summary()
  • Update word-break: break-word to overflow-wrap: break-word

Recommended (can be follow-up):

  • Add focus management to popover for screen readers
  • Extract magic numbers (200ms, 500ms) to named constants
  • Add test for HTML in tool results to verify XSS prevention

Nice to have:

  • Consider aria-live for streaming results
  • Add viewport edge case tests for popover positioning

🎖️ Overall Assessment

Code Quality: ⭐⭐⭐⭐⭐ (5/5)
Test Coverage: ⭐⭐⭐⭐⭐ (5/5)
Documentation: ⭐⭐⭐⭐⭐ (5/5)
Security: ⭐⭐⭐⭐⭐ (5/5)
Performance: ⭐⭐⭐⭐ (4/5)

Total: ⭐⭐⭐⭐⭐ Excellent work!

This feature significantly improves agent transparency and user trust. The implementation follows all project conventions, has comprehensive tests, and demonstrates strong engineering practices.

cc: @JSv4

@JSv4 JSv4 merged commit 8a743b7 into main Feb 16, 2026
4 checks passed
@JSv4 JSv4 deleted the claude/expose-tool-usage-k5Q20 branch February 16, 2026 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants