Add tool usage visibility to chat messages with popover details by JSv4 · Pull Request #880 · Open-Source-Legal/OpenContracts

JSv4 · 2026-02-11T04:51:49Z

Summary

This PR enhances the chat UI to expose tool usage information to users by adding a tool usage badge to assistant messages and a detailed popover showing each tool call's inputs and outputs.

Key Changes

Frontend (ChatMessage Component)

Tool Usage Badge: Added ToolUsageIndicator component that displays a wrench icon badge ("X tools used") in the message header for assistant messages that executed tools
Tool Call Popover: Implemented a rich hover popover showing:
- Tool name (formatted from snake_case to Title Case)
- Input arguments (displayed as formatted JSON)
- Output result (displayed in a highlighted result block)
Styled Components: Added comprehensive styling for the badge, popover, and tool call cards with responsive design and smooth animations
Helper Functions:
- extractToolCalls(): Pairs tool_call and tool_result timeline entries
- formatToolName(): Converts tool names to user-friendly display format
UI Integration: Tool badge appears in a new MessageHeader container alongside the username, replacing the previous TimelineIndicator when tools are used

Backend (Timeline & Agent Updates)

Timeline Schema: Added result field to timeline entries to store tool output content
Timeline Builder: Modified to capture tool result content from ThoughtEvent metadata and truncate to 500 characters to keep payloads lightweight
Agent Handlers: Updated tool result emission for:
- similarity_search: Emits result summary ("Found X matching annotations")
- search_exact_text: Emits result summary ("Found X exact text matches")
- ask_document: Extracts and emits the answer text
- Generic tools: Converts result content to string
Data Flow: Tool result content is passed through ThoughtEvent metadata and captured in timeline entries

Data Model Updates

Updated TimelineEntry interface to include optional result field
Added ToolCallInfo interface for popover data structure
Updated chat components (CorpusChat, ChatTray) to pass tool_result from metadata to timeline entries

Implementation Details

Tool usage detection uses timeline.some((e) => e.type === "tool_call") to determine if badge should display
Popover uses Framer Motion for smooth animations (opacity, scale, and position transitions)
Result content is truncated at 500 characters to balance visibility with payload size
Responsive design with mobile-specific adjustments for badge and popover sizing
Hover state management with debounced close timeout to prevent flickering

https://claude.ai/code/session_01GHGe5Ye7Ag9uDrw31mSBXc

Add a dedicated tool usage indicator to assistant messages in both document and corpus chat views. When an agent uses tools, a blue wrench badge ("X tools used") appears in the message header. Hovering over it shows a rich popover with each tool call's name, input arguments, and output result. Backend changes capture tool result content in timeline entries (previously only the tool name was stored). Search tools (similarity_search, search_exact_text) now also emit tool_result timeline entries with human-readable summaries. - Add `result` field to TimelineEntry schema and TypedDict - Capture tool result content in TimelineBuilder (truncated to 500 chars) - Pass tool result through ThoughtEvent metadata in pydantic_ai_agents - Add ToolUsageIndicator component with hover ToolCallPopover - Add MessageHeader layout for inline badge display - Pass tool_result from WebSocket data in CorpusChat and ChatTray https://claude.ai/code/session_01GHGe5Ye7Ag9uDrw31mSBXc

claude · 2026-02-11T04:53:55Z

Code Review: PR #880 - Add tool usage visibility to chat messages

Overall Assessment

This PR adds valuable UX improvements by exposing tool usage information in the chat UI. The implementation is well-structured with good separation of concerns between frontend and backend. However, there are several areas that need attention before merging.

✅ Strengths

Good Architecture: Clean separation between data model changes (timeline schema), data capture (backend agents), and UI presentation (frontend components)
Performance Conscious: 500-character truncation limit on tool results prevents payload bloat
User Experience: Hover popover pattern provides details on-demand without cluttering the UI
Type Safety: Proper TypeScript interfaces (ToolCallInfo, TimelineEntry with result field)
Changelog Documentation: Comprehensive changelog entries following project conventions

🔴 Critical Issues

1. Missing Cleanup in ToolUsageIndicator (frontend/src/components/widgets/chat/ChatMessage.tsx:1158)

The closeTimeoutRef is never cleaned up when the component unmounts, which can cause memory leaks and "setState on unmounted component" warnings.

Fix needed:

const ToolUsageIndicator: React.FC<{
  timeline: TimelineEntry[];
}> = ({ timeline }) => {
  const toolCalls = useMemo(() => extractToolCalls(timeline), [timeline]);
  const [isOpen, setIsOpen] = useState(false);
  const closeTimeoutRef = useRef<ReturnType<typeof setTimeout> | null>(null);

  // ADD THIS useEffect:
  useEffect(() => {
    return () => {
      if (closeTimeoutRef.current) {
        clearTimeout(closeTimeoutRef.current);
      }
    };
  }, []);

  // ... rest of component

2. Tool Call Pairing Logic May Fail (frontend/src/components/widgets/chat/ChatMessage.tsx:1119-1142)

The extractToolCalls function assumes tool results immediately follow their calls with matching tool names. This could fail if:

Multiple calls to the same tool occur in sequence
Tool results arrive out of order
A tool call has no matching result (failure case)

Issue: If you call similarity_search twice in a row, the first call might match the second result.

Suggested improvement:
Consider adding a unique identifier to tool_call/tool_result entries in the backend or using index-based pairing with a "consumed" flag to ensure correct matching.

3. Potential XSS Risk (frontend/src/components/widgets/chat/ChatMessage.tsx:1095)

Tool arguments are rendered with JSON.stringify but tool results are rendered directly:

<ToolCallResultBlock>{call.result}</ToolCallResultBlock>

If tool results contain user-generated content or malicious strings, this could be an XSS vector. The CLAUDE.md mentions: "User-generated content in JSON fields must be escaped on frontend."

Fix needed: Either sanitize call.result or document why it's safe (if tool results are always system-generated).

⚠️ Significant Issues

4. No Test Coverage

This PR adds significant new functionality but includes no tests:

No frontend component tests for ToolUsageIndicator
No backend tests for tool result capture in timeline
No integration tests verifying tool_call/tool_result pairing

Required tests:

Frontend: Test popover rendering, hover interactions, tool call extraction logic
Backend: Test that timeline_utils.TimelineBuilder correctly captures tool results
Backend: Verify 500-char truncation works correctly
Integration: Verify end-to-end tool usage display

5. Accessibility Issues

The popover implementation lacks accessibility features:

No keyboard navigation support (can't tab to/open popover)
No ARIA attributes (role="tooltip", aria-describedby)
No focus management for screen readers
Popover positioning may clip on mobile (only uses right: 0)

6. Missing Error Handling (opencontractserver/llms/agents/pydantic_ai_agents.py:772-804)

Tool result extraction has broad except Exception blocks that silently swallow errors:

try:
    rp = event.result.content
    if isinstance(rp, dict):
        ask_result_summary = rp.get("answer", "")
    elif isinstance(rp, str):
        ask_result_summary = rp
except Exception:
    pass  # ← Silent failure

Issue: If tool result extraction fails, users see incomplete information with no indication anything went wrong.

Suggestion: Log exceptions or emit a fallback message like "Error extracting result"

📝 Code Quality Issues

7. Inconsistent Naming (opencontractserver/llms/agents/pydantic_ai_agents.py)

Variables use inconsistent naming:

ask_result_summary (line 775)
generic_result (line 801)
result_summary (line 617)

Consider standardizing to tool_result_summary throughout.

8. Duplicate Logic

The tool result capture logic is repeated 4 times in pydantic_ai_agents.py (lines 612-656, 772-798, 801-812) with only minor variations. This violates DRY principle.

Suggestion: Extract to a helper function:

def extract_tool_result_summary(event, tool_name: str, max_length: int = 500) -> str:
    """Extract and format tool result for timeline."""
    # Centralized logic here

9. Magic String (frontend/src/components/widgets/chat/ChatMessage.tsx:1125)

tool: entry.tool || "Unknown Tool",

"Unknown Tool" should be a constant defined in frontend/src/assets/configurations/constants.ts per CLAUDE.md guidance.

10. Type Safety Issue

The args field is typed as any in both TimelineEntry and ToolCallInfo. Consider using unknown for better type safety:

args?: unknown;

🔍 Performance Considerations

11. Unnecessary Re-renders

The extractToolCalls function runs on every render even though timeline rarely changes. Using useMemo is good, but consider if the parent component is memoized.

12. Missing Index Keys

{toolCalls.map((call, idx) => (
  <ToolCallCard key={idx}>

Using array index as key is an anti-pattern. If timeline changes, React may re-render wrong items. Consider using ${call.tool}-${idx} or adding a unique ID to timeline entries.

🔒 Security Concerns

13. No Rate Limiting on Popover Interactions

A user could rapidly hover/unhover to trigger many state updates. Consider debouncing the handleMouseEnter.

14. Result Truncation Edge Case

The 500-char truncation adds "..." but doesn't indicate HOW MUCH was truncated. Users might miss important context. Consider: (truncated from ${original_length} chars)...

📋 Missing Documentation

15. No JSDoc Comments

New exported functions lack documentation:

extractToolCalls - should document the pairing algorithm and edge cases
formatToolName - should document expected input format
ToolUsageIndicator - should document props and behavior

16. Timeline Schema Documentation

The result field in timeline_schema.py has no comment explaining its purpose, truncation, or format.

🎯 Recommendations

High Priority (Must Fix Before Merge):

Add cleanup to ToolUsageIndicator useEffect
Add test coverage (at minimum, unit tests for extractToolCalls)
Fix or document XSS safety for tool results
Improve tool call/result pairing logic to handle edge cases

Medium Priority (Should Fix):

Add accessibility features (ARIA, keyboard nav)
Extract duplicate tool result logic to helper function
Add proper error handling/logging
Replace magic strings with constants

Low Priority (Nice to Have):

Add JSDoc documentation
Use stable keys instead of array index
Improve truncation UX with character count
Consider debouncing hover interactions

🧪 Testing Recommendations

Before merging, add these tests:

Frontend (Playwright component tests):

yarn test:ct --reporter=list -g "ToolUsageIndicator"

Test popover shows on hover
Test correct tool count display
Test tool call extraction with multiple tools
Test formatting of tool names
Test truncated results display

Backend (pytest):

docker compose -f test.yml run django pytest opencontractserver/llms/agents/tests/ -n 4 --dist loadscope -k timeline

Test TimelineBuilder captures tool results
Test 500-char truncation
Test result extraction for each tool type
Test timeline schema validation

📊 Impact Assessment

User-facing changes: High (new UI feature)
Breaking changes: None
Performance impact: Low (minor memory overhead from timeline results)
Security risk: Medium (potential XSS, needs review)
Maintenance burden: Low (well-isolated feature)

Verdict

Request Changes - The feature is valuable and well-designed, but needs test coverage and several bug fixes before it's production-ready. The critical issues around memory leaks, tool pairing logic, and XSS must be addressed.

Critical fixes: - Fix memory leak: add useEffect cleanup for closeTimeoutRef on unmount - Fix tool call pairing: use consumed-set approach so duplicate calls to the same tool are correctly paired with their respective results - Add keyboard navigation (Enter/Space/Escape) and ARIA attributes (role, aria-expanded, aria-haspopup, aria-describedby) to popover Code quality: - Extract _extract_tool_result_summary() helper in pydantic_ai_agents to eliminate 4x duplicated result extraction with bare except blocks - Use TOOL_UNKNOWN_LABEL constant from constants.ts instead of magic string - Use stable call.id keys instead of array index in React list rendering - Add JSDoc to extractToolCalls, formatToolName, ToolUsageIndicator - Document XSS safety (React text nodes auto-escape, no innerHTML) - Improve truncation UX: include original char count in truncation suffix - Add inline doc to timeline_schema.py result field Tests: - Add backend test_timeline_utils.py: 14 tests covering tool result capture, truncation, pairing, reset, dict passthrough, type coercion - Add frontend chatMessageUtils.test.ts: 12 tests covering extractToolCalls pairing, duplicate tools, stable IDs, formatToolName https://claude.ai/code/session_01GHGe5Ye7Ag9uDrw31mSBXc

claude · 2026-02-11T05:28:02Z

PR Review: Add tool usage visibility to chat messages with popover details

Summary

This is a well-implemented feature that significantly improves user visibility into LLM agent tool usage. The PR adds a polished UI indicator showing which tools were used in assistant responses, with detailed input/output information available via hover popover.

Code Quality: Excellent ✅

Strengths

1. Comprehensive Testing

Backend: 14 unit tests covering edge cases (truncation, pairing, duplicate tools, missing data)
Frontend: 12 unit tests for extraction and formatting logic
Tests follow project conventions and are well-documented
Edge cases thoroughly covered (missing results, duplicate tool calls, exact-length truncation)

2. DRY Principle Applied

_extract_tool_result_summary() helper eliminates 4x duplicated result extraction with bare except blocks (pydantic_ai_agents.py:674-697)
Centralized TOOL_UNKNOWN_LABEL constant in constants.ts instead of magic strings
Extracted utility functions (extractToolCalls, formatToolName) for testability and reuse

3. Memory Leak Prevention

Proper cleanup of closeTimeoutRef in useEffect (ChatMessage.tsx:339-345)
This addresses a common React pitfall with setTimeout in components

4. Accessibility

Full keyboard navigation (Enter/Space to toggle, Escape to close)
Proper ARIA attributes (role, aria-expanded, aria-haspopup, aria-describedby)
Screen reader friendly with descriptive labels

5. Correct Pairing Algorithm

Consumed-set approach prevents mismatched pairing when same tool called multiple times (ChatMessage.tsx:278-315)
Stable React keys using call.id instead of array indices

claude · 2026-02-11T05:28:13Z

6. Security Considerations

XSS safety documented: content rendered as React text nodes (auto-escaped), not via dangerouslySetInnerHTML
Content originates from backend agent execution, not user input (ChatMessage.tsx:253-255)

7. Documentation

Excellent JSDoc comments on exported functions
Inline documentation for non-obvious logic (truncation, pairing)
CHANGELOG.md properly updated with file paths and line numbers

8. Performance Optimizations

500-char truncation limit keeps WebSocket payloads lightweight
useMemo for expensive extractToolCalls computation
Debounced popover close (200ms) prevents flickering

Best Practices Adherence ✅

Follows CLAUDE.md conventions:

✅ Uses utility files (constants.ts) for hardcoded values
✅ Comprehensive test coverage for new functionality
✅ Proper CHANGELOG.md update with file locations and impact
✅ DRY architecture with extracted helpers
✅ No magic numbers/strings
✅ Single responsibility principle (ToolUsageIndicator component)

claude · 2026-02-11T05:28:29Z

Areas for Consideration

1. Backend: Inconsistent Result Formatting

The _extract_tool_result_summary() helper returns different formats:

ask_document: Returns actual answer text
similarity_search: Returns Found X matching annotations
search_exact_text: Returns Found X exact text matches
Generic fallback: Completed

Suggestion: Consider standardizing the format. If some tools return actual data (like ask_document answer text) while others return counts, users may be confused about what to expect. Either:

Always return metadata summaries (Found X, Returned Y chars, etc.)
Or always return actual result content (truncated)

File: opencontractserver/llms/agents/pydantic_ai_agents.py:674-697

2. Frontend: TypeScript Type Safety

The args field uses any type in TimelineEntry interface (line 52). Consider using Record<string, unknown> or JsonValue type instead of any for better type safety, consistent with ToolCallInfo.args.

File: frontend/src/components/widgets/chat/ChatMessage.tsx:52

3. UX: Truncation Visibility

When results are truncated, users see ... (truncated from 600 chars) but have no way to view the full content.

Suggestion (optional enhancement): Consider adding a View full output button/link when truncated, or increase MAX_TOOL_RESULT_LENGTH if 500 chars proves too limiting in practice.

File: opencontractserver/llms/agents/timeline_utils.py:84-88

4. UI Polish: Popover Positioning

Popover is positioned right: 0 which works well for most cases, but on mobile its adjusted to right: -1rem. The popover could overflow viewport on very narrow screens or when badge is at screen edge. The current implementation is reasonable, but consider testing on various device widths.

File: frontend/src/components/widgets/chat/ChatMessage.tsx:161-164

claude · 2026-02-11T05:28:39Z

Performance Considerations ✅

Strengths:

Truncation at 500 chars prevents payload bloat
useMemo prevents re-extraction on every render
Timeline filtering is O(n) which is acceptable for typical timeline lengths

No concerns - Performance appears well-optimized.

Security Concerns ✅

Strengths:

XSS prevention properly documented and implemented
Content escaping handled by React default text node rendering
No dangerouslySetInnerHTML usage
Backend data flows through proper channels

No security vulnerabilities identified.

Test Coverage ✅

Backend (test_timeline_utils.py):

✅ Result capture and truncation
✅ Pairing logic for multiple same-tool calls
✅ Missing data handling
✅ Type coercion for non-string results
✅ Timeline reset functionality

Frontend (chatMessageUtils.test.ts):

✅ Empty timeline handling
✅ Single and multiple tool calls
✅ Duplicate tool pairing
✅ Missing tool names (TOOL_UNKNOWN_LABEL)
✅ Stable ID generation
✅ formatToolName edge cases

Coverage is comprehensive and follows project testing patterns.

claude · 2026-02-11T05:28:53Z

Bugs or Issues

No blocking issues identified. The code is production-ready.

Minor Nit

File: opencontractserver/llms/agents/timeline_utils.py:87

The truncation message converts tool_result_content to string twice (once on line 83, once in the f-string). For efficiency, consider storing the original length before truncation. This is a very minor optimization and does not affect correctness.

Overall Assessment

LGTM with minor suggestions 🚀

This is high-quality work that demonstrates:

Strong understanding of React best practices (hooks, memoization, cleanup)
Proper accessibility implementation
Comprehensive testing strategy
Security awareness
Good documentation habits
Adherence to project conventions

The suggestions above are minor improvements, not blockers. The PR is ready to merge as-is, with those items potentially addressed in future iterations based on user feedback.

Recommendation: Approve and merge ✅

https://claude.ai/code/session_01GHGe5Ye7Ag9uDrw31mSBXc

claude · 2026-02-11T05:42:46Z

PR Review: Tool Usage Visibility in Chat UI

I've completed a thorough review of this PR. Overall, this is excellent work with solid implementation, comprehensive testing, and good documentation. The code quality is high and follows the project's conventions well.

✅ Strengths

1. Comprehensive Test Coverage

Frontend: 10 test cases covering edge cases (duplicate tool calls, missing tools, empty timelines, pairing logic)
Backend: 14 test cases covering truncation, pairing, metadata handling, and various event types
Tests are well-documented with clear docstrings explaining what they verify

2. Security and XSS Prevention

Proper XSS safety: Tool results rendered as React text nodes (not dangerouslySetInnerHTML)
Good documentation comment explaining safety at ChatMessage.tsx:824-826
Input sanitization via JSON formatting for arguments

3. Code Quality

DRY principle: Utility functions (extractToolCalls, formatToolName) extracted and tested separately
Single Responsibility: ToolUsageIndicator is focused and well-encapsulated
Constants: New constant TOOL_UNKNOWN_LABEL properly added to constants file
Clean separation between presentation (styled components) and logic (utility functions)

4. Accessibility

Proper ARIA attributes: role="button", aria-expanded, aria-haspopup, aria-describedby
Keyboard navigation: Enter, Space, and Escape key handlers
Good semantic HTML structure

5. Performance Considerations

useMemo for extractToolCalls prevents unnecessary recalculation
Result truncation at 500 chars keeps payloads lightweight
Cleanup of timeout refs on unmount prevents memory leaks
Debounced close timeout (200ms) prevents flickering

6. Changelog and Documentation

Excellent changelog entries with file locations and detailed descriptions
Good inline code comments explaining complex logic
Clear XSS safety documentation

🔍 Issues and Recommendations

Critical: Potential Pairing Bug in extractToolCalls

Location: frontend/src/components/widgets/chat/ChatMessage.tsx:876-940

The current implementation searches forward from each tool_call to find the first unconsumed tool_result with matching tool name. However, if tool_result entries appear before their corresponding tool_call in the timeline, they will never be paired.

Recommended fix: Either:

Document the assumption that tool_result always follows tool_call in the timeline (if this is guaranteed by backend)
Make pairing bidirectional or use a two-pass approach to handle out-of-order entries

Impact: Low if backend guarantees ordering, but could cause silent failures if timeline order changes.

Minor: Inconsistent Error Logging

Location: opencontractserver/llms/agents/pydantic_ai_agents.py:621-638

Using logger.error for diagnostic/informational messages. These should be logger.debug or logger.info.

Impact: Low, but pollutes error logs with non-error information.

Recommendation: Change diagnostic logs to appropriate level at lines 621-625, 635-637, and similar patterns.

Code Style: Trailing Commas

The PR includes many formatting changes adding trailing commas. While this follows project conventions, these changes add noise to the diff.

Recommendation: For future PRs, consider running formatters before starting work to avoid mixing formatting changes with feature changes.

Impact: None functionally, but reduces review clarity.

🔒 Security Review

✅ XSS Prevention: Properly handled via React text nodes
✅ Input Validation: Tool arguments formatted via JSON.stringify
✅ Data Truncation: 500-char limit prevents payload bombs
✅ No Injection Risks: No dynamic HTML generation

🎯 Performance Review

✅ Memoization: useMemo for extractToolCalls and hasToolUsage
✅ Lazy Rendering: AnimatePresence only renders popover when open
✅ Cleanup: Proper timeout cleanup on unmount
✅ Data Size: Truncation keeps timeline payloads reasonable

Potential Concern: The extractToolCalls function uses O(n²) algorithm. For very large timelines this could be slow, though timelines are typically small.

📋 Test Coverage Assessment

Frontend: ✅ Excellent (10 test cases, all edge cases covered)
Backend: ✅ Excellent (14 test cases, comprehensive coverage)

🎨 UI/UX Review

✅ Visual Design: Clean, professional badge design
✅ Responsive: Mobile-specific adjustments present
✅ Animations: Smooth transitions with Framer Motion
✅ Feedback: Clear hover states and visual hierarchy

📝 Documentation Quality

✅ Changelog: Excellent detail with file locations
✅ Code Comments: Good explanations for complex logic
✅ Test Docstrings: Clear descriptions of what each test verifies
✅ Type Definitions: Well-documented interfaces

✨ Adherence to CLAUDE.md Guidelines

✅ No Claude credits: Correctly avoided in commit messages
✅ Constants usage: New constant added to constants file
✅ Utility functions: Extracted to testable utilities
✅ Test patterns: Follows project conventions
✅ No dead code: Clean implementation

🎯 Final Recommendation

Approval Status: ✅ Approve with minor suggestions

This is high-quality work that significantly improves the user experience. The implementation is solid, well-tested, and follows project conventions.

Required Before Merge:

Address the extractToolCalls pairing logic - either document the ordering assumption or fix the bidirectional case
Fix diagnostic logging levels - change logger.error to logger.debug for diagnostic messages

Optional Improvements (Post-Merge):

Consider O(n) optimization for extractToolCalls if timeline size becomes a concern
Add visual indicator for scrollable result blocks
Enhance tool result summaries with tool-specific formatting

Great work on this feature! The tool usage visibility will be very helpful for users understanding what the agent is doing. 🚀

…age-k5Q20 # Conflicts: # CHANGELOG.md

claude · 2026-02-12T19:27:49Z

Code Review: Tool Usage Visibility Feature

This PR successfully adds tool usage visibility to the chat UI with a badge and popover showing tool call details. The implementation is well-structured with comprehensive test coverage on both frontend and backend.

✅ Strengths

1. Excellent Architecture & Design

Clean separation of concerns: Tool call extraction logic is properly extracted as testable utility functions
Smart pairing algorithm: The consumed-set approach in extractToolCalls correctly handles duplicate tool calls
Proper TypeScript typing with new interfaces (ToolCallInfo, updated TimelineEntry)
Responsive design with mobile-specific adjustments

2. Outstanding Test Coverage

Frontend unit tests: 11 comprehensive test cases covering edge cases
Backend tests: 12 test cases covering tool result capture, truncation logic, and pairing
Well-documented tests with clear descriptions

3. Security & Performance

XSS safety: Tool result content rendered as React text nodes (properly documented)
Payload optimization: 500-character truncation keeps timeline lightweight
Efficient rendering: useMemo hooks prevent unnecessary recalculations

4. Excellent User Experience

Accessibility: Proper ARIA attributes and keyboard navigation (Enter, Space, Escape)
Visual polish: Smooth animations, hover states, debounced close (200ms)
Comprehensive tool info: Shows name, input args (JSON), and output results

5. Code Quality

CHANGELOG.md properly updated
Constants properly managed (TOOL_UNKNOWN_LABEL in constants.ts)
Follows existing codebase patterns
Good error handling with fallback to "Completed"

🔍 Issues & Concerns

CRITICAL: Potential React Key Stability Issue

Location: frontend/src/components/widgets/chat/ChatMessage.tsx:891

The key generation uses calls.length which can create non-unique keys:

id: `${entry.tool ?? TOOL_UNKNOWN_LABEL}-${calls.length}`,

Problem: For duplicate tool calls, this can cause React rendering issues or component state loss.

Fix: Use timeline index instead:

id: `${entry.tool ?? TOOL_UNKNOWN_LABEL}-${i}`,  // Use timeline index i

MINOR: Timeline Indicator Logic

Location: frontend/src/components/widgets/chat/ChatMessage.tsx:1898

The old timeline indicator is hidden when tools are used. Consider adding a clarifying comment:

{/* Show step count indicator only when no tools were used (tool badge replaces it) */}
{effectiveHasTimeline && !hasToolUsage && (

MINOR: Backend Tool Result Extraction Duplication

Location: opencontractserver/llms/agents/pydantic_ai_agents.py:626-700

Duplicated result extraction logic for similarity_search and search_exact_text. Consider extracting to the _extract_tool_result_summary helper function to reduce duplication.

MINOR: Error Handling Fallback Message

Location: opencontractserver/llms/agents/pydantic_ai_agents.py:1300-1315

The fallback "Completed" is fine, but consider "Tool execution completed (details unavailable)" for clarity.

📊 Performance Considerations

Good:

useMemo prevents recalculation on every render
Timeline truncation keeps payloads small
Popover only renders when open

Note: extractToolCalls has O(n²) complexity but acceptable for typical small timelines (<20 entries)

🔒 Security Review

Excellent XSS Protection:

Tool content rendered safely via React text nodes (not dangerouslySetInnerHTML)
Arguments serialized via JSON.stringify
Tool names controlled by backend
Properly documented in code comments

🎯 Recommendations Summary

Must Fix Before Merge

Fix React key generation in extractToolCalls to use timeline index

Nice to Have

Add clarifying comment about timeline indicator mutual exclusivity
Consider consolidating duplicated tool result summary logic
Add frontend test case for >500 character results

Future Enhancements (Not blocking)

Copy-to-clipboard for tool args/results
Show execution time if captured

✅ Overall Assessment

High-quality PR with:

Comprehensive test coverage (23 total test cases)
Proper security considerations
Excellent UX with accessibility
Clean, maintainable code
Proper documentation

Recommendation: Approve with one critical fix (React key stability)

The implementation successfully exposes tool usage in a polished, accessible way. Great work!

🧪 Testing Notes

To verify before merge:

Frontend: yarn test:unit chatMessageUtils.test.ts
Backend: docker compose -f test.yml run django pytest opencontractserver/tests/test_timeline_utils.py -v
TypeScript: yarn tsc --noEmit
Pre-commit: pre-commit run --all-files

Per CLAUDE.md baseline rules, ensure all tests pass and TypeScript compiles.

codecov · 2026-02-12T20:17:36Z

Codecov Report

❌ Patch coverage is 98.41270% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...encontractserver/llms/agents/pydantic_ai_agents.py	95.38%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

…cation - Use timeline index for stable React keys in extractToolCalls - Document forward-only pairing assumption in extractToolCalls - Change args type from any to Record<string, unknown> on TimelineEntry - Downgrade DIAGNOSTIC logger.error calls to logger.debug - Store string length before truncation to avoid double conversion - Fix prettier formatting across affected files

claude · 2026-02-13T06:09:49Z

PR Review: Add Tool Usage Visibility to Chat Messages

This PR adds a helpful feature that exposes tool usage information to users through a badge and popover UI. The implementation is comprehensive with good test coverage. Here's my detailed feedback:

✅ Strengths

Code Quality

Excellent test coverage: Both frontend (Vitest unit tests) and backend (Python unit tests) are well-covered
Clean separation of concerns: Tool call extraction logic is separated into testable utility functions
Good documentation: Inline comments explain key decisions (consumed-set approach, XSS safety, etc.)
Type safety: TypeScript interfaces are well-defined with JSDoc comments

Architecture

Smart pairing algorithm: The extractToolCalls function uses a consumed-set approach to correctly pair duplicate tool calls (ChatMessage.tsx:1133-1164)
Performance considerations: Uses useMemo for tool call extraction to avoid re-computation
Accessibility: ARIA attributes for the popover (role="dialog", aria-expanded, etc.) at ChatMessage.tsx:1227-1233
Result truncation: Backend sensibly limits result content to 500 chars to keep payloads lightweight (timeline_utils.py:93-95)

UX

Graceful degradation: Tool badge only shows when tools are actually used
Responsive design: Mobile-specific styles with appropriate breakpoints
Hover debouncing: 200ms delay prevents flickering (ChatMessage.tsx:1206-1208)
Cleanup on unmount: Timeout cleanup prevents memory leaks (ChatMessage.tsx:1187-1194)

🔍 Issues & Suggestions

1. Backend: Changed logger.error to logger.debug ⚠️

Location: pydantic_ai_agents.py lines 711-1040 (throughout)

Issue: The PR changes diagnostic logging from logger.error to logger.debug. While this reduces log noise, these diagnostics appear valuable for troubleshooting agent execution issues.

Recommendation:

Keep as logger.debug for normal operation
Add a comment explaining that these can be temporarily changed to logger.error for debugging if needed
Consider using a structured logging approach with log levels that can be adjusted per-module

2. Frontend: Missing keyboard accessibility for popover close

Location: ChatMessage.tsx ToolUsageIndicator component (lines 1180-1288)

Issue: While the badge has keyboard support (Enter/Space to toggle, Escape to close), there's no visual focus indicator when the popover is open, and no focus management when it opens/closes.

Recommendation:

// Add focus trap or at minimum focus management
useEffect(() => {
  if (isOpen && popoverRef.current) {
    // Focus first interactive element or popover itself
    popoverRef.current.focus();
  }
}, [isOpen]);

3. Frontend: Potential issue with popover positioning on mobile

Location: ChatMessage.tsx:735-754

Issue: The popover uses right: 0 positioning, but on mobile it shifts to right: -1rem. This could cause the popover to overflow the viewport on narrow screens.

Recommendation: Consider adding viewport-aware positioning or a max-width constraint:

max-width: min(440px, calc(100vw - 2rem));

4. Backend: _extract_tool_result_summary exception handling too broad

Location: pydantic_ai_agents.py:123-145

Issue: The bare except Exception catches all exceptions, which could hide bugs.

Recommendation: Catch specific exceptions:

except (AttributeError, KeyError, TypeError) as e:
    logger.debug(
        "Could not extract tool result summary for %s: %s", 
        tool_name, str(e), exc_info=True
    )

5. Frontend: XSS comment could be more precise

Location: ChatMessage.tsx:842-844

Issue: The comment states "Content originates from backend agent tool execution, not user input." However, tool arguments CAN include user input (e.g., search queries).

Recommendation: Clarify the comment:

// XSS safety: Content is rendered as React text nodes (auto-escaped).
// Tool args may contain user input, but React's JSX escaping handles this safely.
// Tool results are LLM/tool outputs and also auto-escaped.

6. Backend: Inconsistent result emission for tools

Location: pydantic_ai_agents.py:633-950

Issue: Only similarity_search, search_exact_text, and ask_document get explicit result summaries. Other tools use the generic _extract_tool_result_summary which may return "Completed" for most tools.

Recommendation: Document this behavior or add result summaries for more tools. For example:

add_annotations_from_exact_strings: "Added X annotations"
`update_document_summary": "Summary updated successfully"

7. Frontend: Popover z-index could conflict

Location: ChatMessage.tsx:739

Issue: z-index: 100 might conflict with other UI elements. The codebase has various z-index values (e.g., LabelMenu uses 2000).

Recommendation: Define z-index constants in the constants file to maintain a z-index hierarchy:

export const Z_INDEX = {
  TOOL_POPOVER: 1000,
  LABEL_MENU: 2000,
  // ... other values
} as const;

📊 Test Coverage Assessment

Frontend Tests ✅

Comprehensive unit tests for extractToolCalls (9 test cases)
Edge cases covered: empty timeline, missing tool names, duplicate tools, pairing logic
Test for formatToolName utility
Missing: Component tests for ToolUsageIndicator (hover behavior, keyboard interaction)

Backend Tests ✅

Excellent coverage for TimelineBuilder (13 test cases)
Covers truncation, pairing, missing metadata, type conversion
Missing: Integration test showing full flow from tool execution → timeline → frontend

🔒 Security Review

XSS Prevention: ✅ Safe

Content rendered via React text nodes (auto-escaped)
JSON.stringify used for displaying args (safe within text nodes)
No use of dangerouslySetInnerHTML

Data Truncation: ✅ Appropriate

500 char limit prevents DoS via massive tool outputs
Truncation message shows original length for transparency

🎯 Performance Considerations

Frontend:

✅ useMemo used for tool call extraction
✅ Cleanup on unmount prevents memory leaks
⚠️ Framer Motion adds ~30KB to bundle (acceptable for enhanced UX)

Backend:

✅ Result truncation keeps WebSocket payloads lightweight
✅ Minimal processing overhead (string slicing + metadata extraction)

📝 Changelog Compliance

✅ Changelog follows "Keep a Changelog" format
✅ Includes file paths and component names
✅ Clear description of user-facing changes

Minor suggestion: Add a "Technical Details" subsection mentioning the pairing algorithm and truncation strategy for future maintainers.

🎨 Code Style & Conventions

✅ Follows project patterns (styled-components, Jotai atoms)
✅ Responsive design with mobile breakpoints
✅ Consistent with existing timeline/source indicators
✅ No dead code or commented-out sections

🚀 Recommendations Before Merge

High Priority

Fix overly broad exception handling in _extract_tool_result_summary
Add focus management for keyboard accessibility in popover
Add max-width constraint for mobile popover positioning

Medium Priority

Clarify XSS safety comment regarding user input in tool args
Consider adding result summaries for more tools (not just search tools)
Define z-index constants to prevent future conflicts

Low Priority

Add component tests for ToolUsageIndicator hover/keyboard behavior
Consider adding integration test for full tool visibility flow
Add note in changelog about consumed-set pairing algorithm

✨ Overall Assessment

Verdict: ✅ Approve with minor revisions

This is a well-implemented feature that significantly improves transparency around agent tool usage. The code quality is high, test coverage is strong, and the UX is thoughtful. The issues identified are mostly minor and can be addressed quickly.

Estimated time to address high-priority items: ~30-60 minutes

Great work on this enhancement! 🎉

…age-k5Q20

…over 10 functional tests covering badge visibility, hover/keyboard interaction, ARIA accessibility, multi-tool display, and pending tool state. 3 screenshot tests capturing the badge, single-tool popover, and multi-tool popover for auto-generated documentation.

claude · 2026-02-16T21:03:51Z

Code Review: Tool Usage Visibility in Chat Messages

I've completed a thorough review of PR #880. This is an excellent feature addition that enhances transparency by exposing agent tool usage to users. The implementation is well-structured with comprehensive tests and documentation. Below are my findings:

✅ Strengths

Architecture & Design

Clean separation of concerns: Frontend display logic (ToolUsageIndicator), backend data capture (TimelineBuilder), and schema definition (timeline_schema.py) are properly separated
Framework-agnostic timeline: The unified timeline approach in TimelineBuilder ensures consistency across different LLM frameworks
Proper state management: Uses Jotai atoms pattern correctly, integrates well with existing message architecture
Accessibility-first: Comprehensive ARIA attributes (role, aria-expanded, aria-haspopup, aria-describedby) and keyboard navigation (Enter/Space/Escape)

Code Quality

DRY principle followed: _extract_tool_result_summary() helper eliminates code duplication across tool handlers (line 123-145 in pydantic_ai_agents.py)
Defensive coding: Handles missing metadata gracefully, includes truncation with character count display
Type safety: Proper TypeScript interfaces (ToolCallInfo, TimelineEntry) and Python TypedDict
Constants usage: Uses TOOL_UNKNOWN_LABEL and MAX_TOOL_RESULT_LENGTH constants instead of magic values

Testing

Comprehensive test coverage:
- Backend: 14 tests in test_timeline_utils.py covering capture, truncation, pairing, edge cases
- Frontend unit: 12 tests in chatMessageUtils.test.ts for extraction/formatting logic
- Component: 10 functional tests + 3 screenshot tests in ToolUsagePopover.ct.tsx
Test quality: Tests verify real user scenarios (hover/keyboard interaction, multiple tools, pending state, ARIA attributes)
Documentation screenshots: Automated screenshot capture integrated into CI workflow

🔍 Issues & Recommendations

1. Memory Leak Already Fixed ✅

The useEffect cleanup for closeTimeoutRef (lines 1188-1194) properly prevents memory leaks. Good catch in the second commit.

2. React Key Stability ✅

Using timeline index in the key is stable for this use case since timeline entries are append-only and never reordered (line 1144 in ChatMessage.tsx).

3. Tool Call Pairing Logic ✅

The consumed-set approach (lines 1133-1163) correctly handles duplicate tool calls. The forward-only search assumption is documented and valid.

4. Performance Consideration: Memoization

The extractToolCalls function runs on every render via useMemo(() => extractToolCalls(timeline), [timeline]) (line 1183). For long timelines with many tool calls, this could be expensive. However:

Current implementation is probably fine for typical use (most messages have <5 tool calls)
The memoization already prevents unnecessary recalculations when timeline has not changed
No action needed unless profiling shows issues

5. XSS Safety ✅

Line 842-844 documentation correctly notes that content is rendered as React text nodes (auto-escaped), not via dangerouslySetInnerHTML. Safe.

6. Backend: Type Coercion Safety ✅

In timeline_utils.py line 83-89, the truncation logic correctly stores length after conversion. The test at line 61-79 in test_timeline_utils.py confirms this works correctly.

7. Logging Levels ✅

The switch from logger.error to logger.debug for diagnostic messages (commit 5) is correct - these are not operational errors, just trace information.

8. Frontend: Styled Component Complexity

ChatMessage.tsx is now 1985 lines with many styled components. While well-organized, consider:

Future refactor: Extract tool-related styled components to ChatMessage.styles.ts or a separate ToolUsagePopover component file
Not blocking: Current organization is acceptable, just a maintainability suggestion for future work

9. Truncation UX ✅

Line 87-89 shows truncated char count: "... (truncated from 600 chars)" - excellent UX decision. User knows how much was cut.

10. Popover Positioning

The popover uses right: 0 positioning (line 738). On mobile or narrow containers, this could clip off-screen. The test at line 391-392 widens the viewport and adds padding, acknowledging this.

Recommendation: Consider using a tooltip library (like Radix UI Popover) that handles auto-positioning based on viewport boundaries. Not blocking - current implementation works for typical use cases.

🔒 Security Review

✅ No IDOR vulnerabilities (tool data comes from backend, tied to message ownership)
✅ XSS prevention: React auto-escapes text nodes
✅ No SQL injection risks (no dynamic queries)
✅ No authentication bypass concerns
✅ Tool result content originates from backend agent execution (trusted source)

📊 Performance Assessment

✅ Timeline truncation (500 chars) keeps payloads lightweight
✅ Memoization prevents unnecessary re-extraction
✅ Popover uses AnimatePresence for smooth mount/unmount
✅ Hover debounce (200ms close timeout) prevents flickering
⚠️ Watch for: Very long timelines (>50 tools) could slow down extraction, but unlikely in practice

📝 Documentation & Changelog

✅ CHANGELOG.md properly documents the feature with file locations
✅ JSDoc comments on key functions (extractToolCalls, formatToolName)
✅ Inline schema documentation in timeline_schema.py
✅ Test comments explain behavior and assumptions
✅ Screenshot tests generate auto-updated documentation images

🧪 Test Coverage Assessment

Backend

✅ Tool result capture with/without content
✅ Truncation at/above/below MAX_TOOL_RESULT_LENGTH
✅ Tool call/result pairing (single, multiple, duplicate tools)
✅ Edge cases (empty result, non-string values, reset)

Frontend Unit

✅ Empty timeline, single tool, multiple tools
✅ Duplicate tool pairing with stable IDs
✅ Missing tool names (TOOL_UNKNOWN_LABEL fallback)
✅ snake_case to Title Case formatting

Frontend Component

✅ Badge visibility conditions (assistant vs user, with/without tools)
✅ Popover interaction (hover, keyboard, escape)
✅ ARIA attributes and accessibility
✅ Multiple tool display with all details
✅ Pending tool state (call without result yet)
✅ Screenshot capture for documentation

Coverage is excellent - all critical paths tested.

✨ Final Verdict

APPROVED ✅

This is a well-engineered feature that follows the project's conventions and best practices. The code is:

Clean and maintainable
Thoroughly tested
Accessible
Secure
Well-documented

Minor Suggestions (Non-Blocking)

Consider extracting tool popover into separate component file in future refactor (maintainability)
Consider viewport-aware positioning library for popovers (UX enhancement)

Both are nice-to-haves and can be addressed in follow-up work if needed.

📋 Checklist

✅ Code quality and best practices
✅ No bugs or issues identified
✅ Performance considerations addressed
✅ Security review passed
✅ Test coverage comprehensive
✅ Documentation complete
✅ Follows CLAUDE.md conventions
✅ CHANGELOG.md updated
✅ No dead code introduced
✅ DRY principle followed
✅ Single Responsibility Principle maintained

Great work! 🎉

…ettle Screenshots were faded because ChatMessage has transparent backgrounds designed for the chat panel's #f8fafc bg and a framer-motion entrance animation (opacity 0→1, 300ms). Added the proper background wrapper and a 500ms settle delay before capture.

claude · 2026-02-16T21:07:30Z

Code Review: Tool Usage Visibility Enhancement

This PR successfully implements tool usage visibility in the chat UI with comprehensive testing and well-structured code. Here's my detailed feedback:

✅ Strengths

1. Excellent Test Coverage

Backend: test_timeline_utils.py (234 lines) provides thorough coverage of tool result capture, truncation, pairing logic, and edge cases
Frontend Unit Tests: chatMessageUtils.test.ts tests the extraction and formatting logic with 12 test cases covering duplicate tools, missing data, and edge cases
Frontend Component Tests: ToolUsagePopover.ct.tsx (480 lines) includes 14 tests covering UI behavior, keyboard navigation, ARIA attributes, and documentation screenshots
All tests follow the project's established patterns (e.g., using --reporter=list for Playwright)

2. Clean Architecture

Separation of concerns: extractToolCalls() and formatToolName() are pure functions tested separately from the UI component
TimelineBuilder centralization: Tool result logic is properly centralized in timeline_utils.py rather than duplicated across adapters
Consumed-set pairing algorithm: The extractToolCalls() function uses a smart algorithm to correctly pair tool_call/tool_result entries when the same tool is called multiple times (ChatMessage.tsx:1133-1164)

3. Security & Accessibility

XSS safety: Proper comment at line 842-844 documenting that tool content is rendered as React text nodes (auto-escaped)
Keyboard navigation: Full keyboard support with Enter/Space to toggle, Escape to close
ARIA attributes: Proper role="button", aria-expanded, aria-haspopup, aria-describedby on badge and role="dialog" on popover
Debounced hover: 200ms delay on mouse leave prevents flickering

4. Performance Considerations

Result truncation: MAX_TOOL_RESULT_LENGTH (500 chars) keeps payloads lightweight while providing useful visibility
useMemo hooks: Tool detection and extraction use useMemo to avoid re-computation
Cleanup: Proper timeout cleanup in useEffect to prevent memory leaks (ChatMessage.tsx:1188-1194)

5. User Experience

Progressive disclosure: Tool usage indicated by badge, details on demand via popover
Visual consistency: Blue theme (#3b82f6) for tool indicators matches existing UI patterns
Responsive design: Mobile-specific adjustments for badge and popover sizing
Framer Motion animations: Smooth 150ms transitions for popover open/close

🔍 Potential Issues & Suggestions

1. Backend: Tool Result Summary Extraction ⚠️

Location: pydantic_ai_agents.py:123-146

The _extract_tool_result_summary() function has a potential issue:

def _extract_tool_result_summary(event: Any, tool_name: str) -> str:
    try:
        result_content = event.result.content  # type: ignore[attr-defined]
        if isinstance(result_content, dict):
            # ask_document returns {"answer": ..., "sources": ..., "timeline": ...}
            summary = result_content.get("answer", "")
            if summary:
                return str(summary)
        elif isinstance(result_content, str) and result_content:
            return result_content
        elif result_content is not None:
            return str(result_content)
    except Exception:
        logger.debug(
            "Could not extract tool result summary for %s", tool_name, exc_info=True
        )
    return "Completed"

Problem: For ask_document, this will extract the full answer text which could be very long (easily exceeding 500 characters). The subsequent truncation in TimelineBuilder (timeline_utils.py:85-89) only truncates at the timeline level, but the metadata passed through ThoughtEvent already contains the full content.

Suggestion: Truncate at source in _extract_tool_result_summary():

def _extract_tool_result_summary(event: Any, tool_name: str) -> str:
    from .timeline_utils import MAX_TOOL_RESULT_LENGTH
    
    try:
        result_content = event.result.content
        summary = ""
        
        if isinstance(result_content, dict):
            summary = result_content.get("answer", "")
        elif isinstance(result_content, str):
            summary = result_content
        elif result_content is not None:
            summary = str(result_content)
        
        if summary:
            # Truncate early to avoid passing large strings through metadata
            if len(summary) > MAX_TOOL_RESULT_LENGTH:
                summary = summary[:MAX_TOOL_RESULT_LENGTH] + "..."
            return summary
    except Exception:
        logger.debug(
            "Could not extract tool result summary for %s", tool_name, exc_info=True
        )
    return "Completed"

2. Frontend: Word Breaking Behavior 🎨

Location: ChatMessage.tsx:208

The message content has word-break: break-all which breaks words at any character, making text harder to read:

word-wrap: break-word;
overflow-wrap: break-word;
word-break: break-all;  // ⚠️ Too aggressive

Suggestion: Use word-break: break-word instead (deprecated but better UX) or remove it entirely and rely on overflow-wrap: break-word:

word-wrap: break-word;
overflow-wrap: break-word;
// word-break removed - overflow-wrap handles long URLs/tokens gracefully

3. Constants Organization 📁

Location: constants.ts (new addition)

The PR adds TOOL_UNKNOWN_LABEL to the frontend constants file, but this is already defined in the backend. Consider:

Document why this duplication is necessary (backend/frontend need separate constants)
OR use a shared constants approach if the project has one

4. Changelog Completeness 📝

Location: CHANGELOG.md

The changelog entry is good but missing:

File locations and line numbers for key changes (per CLAUDE.md guidelines)
Impact statement (e.g., "Users can now see which tools the AI used without expanding the full timeline")

Suggested addition:

### Added
- Tool usage visibility in chat UI (frontend/src/components/widgets/chat/ChatMessage.tsx:1180-1288)
  - Blue badge showing "X tools used" appears in assistant message header when tools are executed
  - Hover popover displays each tool call's name (formatted), input arguments (JSON), and output result
  - Keyboard accessible (Enter/Space to toggle, Escape to close) with full ARIA attributes
  - Impact: Users can quickly see AI's tool usage without expanding the full timeline, improving transparency

### Changed
- Timeline schema updated to include `result` field in tool_result entries (opencontractserver/llms/agents/timeline_schema.py:52)
- Timeline builder now captures tool result content from metadata, truncated to 500 chars (opencontractserver/llms/agents/timeline_utils.py:77-92)
- Agent handlers emit tool result summaries for all tools (opencontractserver/llms/agents/pydantic_ai_agents.py:642-657, 686-702, 807-813)

5. Type Safety 🔒

Location: ChatMessage.tsx:53

The TimelineEntry.result field is string | undefined, but the component doesn't validate this at runtime before rendering. Consider adding a type guard or null check where it's used.

Current:

{call.result && (
  <ToolCallSection>
    <ToolCallResultBlock>{call.result}</ToolCallResultBlock>
  </ToolCallSection>
)}

This is actually correct! The call.result && check properly handles undefined. ✅

🧪 Testing Recommendations

Run Backend Tests in Parallel:

docker compose -f test.yml run django pytest opencontractserver/tests/test_timeline_utils.py -n 4 --dist loadscope

Run Frontend Component Tests:

cd frontend
yarn test:ct --reporter=list -g "Tool Usage Popover"

Manual Testing Scenarios:

Multiple identical tool calls: Verify pairing is correct when similarity_search is called 3 times
Long tool results: Test with results exceeding 500 characters to verify truncation indicator appears
Mobile responsiveness: Test popover positioning on small screens (should not overflow viewport)
Screen reader: Test with NVDA/JAWS to verify ARIA announcements

📊 Performance Impact

Bundle size: ~15KB added (styled-components, helper functions, Framer Motion already present)
Runtime overhead: Negligible - extractToolCalls is O(n²) worst case but n is typically <10
Network payload: Timeline entries grow by ~50-500 chars per tool call (acceptable)

🎯 Summary

Overall Assessment: APPROVE with minor suggestions

This is a well-executed feature with:

✅ Comprehensive test coverage (backend + frontend unit + component)
✅ Clean architecture following project patterns
✅ Excellent accessibility and security practices
✅ Good performance characteristics

Recommended Actions:

High priority: Add truncation to _extract_tool_result_summary() to avoid passing large strings through metadata
Medium priority: Change word-break: break-all to improve text readability
Low priority: Enhance changelog with file locations and impact statement
Low priority: Document TOOL_UNKNOWN_LABEL duplication rationale

Blocking Issues: None

The code is production-ready. The suggestions above are optimizations that can be addressed in follow-up commits if needed.

Great work on this feature! The popover design is clean and the implementation is robust. 🚀

Expand step-7 with screenshots showing the tool usage badge, single-tool popover, and multi-tool popover. Includes a reference table of available agent tools.

…hangelog - Truncate tool result summaries at source in _extract_tool_result_summary() using MAX_TOOL_RESULT_LENGTH (500 chars) to avoid passing large strings through ThoughtEvent metadata - Replace word-break: break-all with word-break: break-word for more natural text wrapping in message content - Fix CI screenshot workflow: backup freshly generated screenshots before rebase, abort failed rebase and fall back to merge, then force-restore screenshots regardless of conflict state. Also add --reporter=list flag to prevent test hangs in CI. - Enhance changelog with file locations, line numbers, and impact statements

…age-k5Q20

claude · 2026-02-16T23:13:35Z

PR Review: Add Tool Usage Visibility to Chat Messages

Overview

This PR successfully implements tool usage transparency in the chat UI by adding a visual indicator badge and detailed popover showing tool call inputs and outputs. The implementation follows the project's architectural patterns and includes comprehensive testing.

✅ Strengths

1. Excellent Architecture & Code Organization

Clean separation of concerns: Timeline data flows from backend → frontend in a well-structured pipeline
Type safety: Proper TypeScript interfaces (TimelineEntry, ToolCallInfo) with schema validation
Reusable utilities: Helper functions (extractToolCalls, formatToolName) are properly tested and documented
DRY principle: Centralized timeline building logic in TimelineBuilder prevents duplication across adapters

2. Strong Testing Coverage

Backend: Comprehensive unit tests in test_timeline_utils.py covering truncation, pairing, edge cases (233 lines)
Frontend:
- Unit tests for utility functions in chatMessageUtils.test.ts (151 lines)
- Component tests in ToolUsagePopover.ct.tsx with visual regression screenshots (480 lines)
Edge case coverage: Empty timelines, missing results, duplicate tool calls, truncation boundaries

3. Performance & UX Considerations

Truncation at source: Tool results truncated to 500 chars in backend before transmission (prevents payload bloat)
Smart pairing algorithm: Consumed-set approach prevents mispairing when same tool called multiple times
Smooth animations: Framer Motion with proper cleanup (useEffect unmount cleanup prevents memory leaks)
Accessibility: Full ARIA attributes, keyboard navigation (Enter/Space/Escape), proper role attributes
Mobile responsive: Separate breakpoints for tablet/mobile with adjusted sizing

4. Documentation & Maintainability

Inline comments: Clear explanations of complex logic (pairing algorithm, XSS safety notes)
CHANGELOG updated: Detailed entries with file locations and line numbers
Automated screenshots: Uses new docScreenshot utility for visual documentation
Constants file usage: TOOL_UNKNOWN_LABEL properly defined in constants (avoiding magic strings)

🔍 Areas for Improvement

1. Security Considerations

XSS Risk in Tool Result Display (Medium Priority)

Location: frontend/src/components/widgets/chat/ChatMessage.tsx:844-857
Issue: While the comment claims content is "auto-escaped", tool results are rendered as React text nodes which is correct. However, the ToolCallCodeBlock uses white-space: pre-wrap which could preserve malicious formatting.
Recommendation:
- Add explicit sanitization for tool results that might contain user-controlled content
- Consider using a dedicated sanitization library like DOMPurify for any content that might include special characters
- Add a test case with malicious input patterns

// Suggested addition in ChatMessage.tsx
import DOMPurify from 'dompurify';

// In ToolCallResultBlock rendering:
<ToolCallResultBlock>
  {DOMPurify.sanitize(call.result, { ALLOWED_TAGS: [] })}
</ToolCallResultBlock>

2. Code Quality Issues

Potential Race Condition in Timeout Management (Low Priority)

Location: ChatMessage.tsx:1206-1207
Issue: The closeTimeoutRef is checked and cleared but there's a race condition if the component unmounts between check and clear
Recommendation:

const handleMouseLeave = () => {
  if (closeTimeoutRef.current) {
    clearTimeout(closeTimeoutRef.current);
  }
  closeTimeoutRef.current = setTimeout(() => setIsOpen(false), 200);
};

Missing Null Check in Backend (Low Priority)

Location: opencontractserver/llms/agents/pydantic_ai_agents.py:136
Issue: event.result.content access without null check could throw AttributeError
Recommendation: Add defensive check:

if not hasattr(event, 'result') or not hasattr(event.result, 'content'):
    return "Completed"
result_content = event.result.content

3. Testing Gaps

Missing Test Cases:

Popover positioning edge cases: What happens when popover would overflow viewport?
Concurrent tool calls: Timeline with interleaved tool calls (call A → call B → result B → result A)
Very long tool names: How does formatToolName handle extremely long snake_case strings?
Unicode in tool results: Test emoji, RTL text, special characters
Backend: No tests for _extract_tool_result_summary error handling paths

Recommendation: Add test cases:

// Frontend test
it("handles extremely long tool names gracefully", () => {
  const longName = "very_long_tool_name_with_many_segments_that_exceeds_normal_length";
  const result = formatToolName(longName);
  expect(result.length).toBeLessThan(100); // Add length constraint
});

4. Performance Considerations

Unnecessary Re-renders (Low Priority)

Location: ChatMessage.tsx:1182
Issue: useMemo for extractToolCalls is good, but hasToolUsage is also memoized separately
Recommendation: Combine into single memo:

const toolCallInfo = useMemo(() => {
  const calls = extractToolCalls(timeline);
  return { calls, hasToolUsage: calls.length > 0 };
}, [timeline]);

Timeline Builder N+1 Pattern (Low Priority)

Location: timeline_utils.py:77-92
Issue: String concatenation in tight loop could be inefficient for very long results
Recommendation: Already handled by truncation, but consider using string builder pattern for future

5. Documentation Issues

Incomplete ARIA Labels

Location: ChatMessage.tsx:1243-1245
Issue: aria-describedby only set when popover is open, but the badge should always have a description
Recommendation:

<ToolBadge
  aria-label={`${toolCalls.length} ${toolCalls.length === 1 ? "tool" : "tools"} used. Press Enter to view details.`}
  // ... other props
/>

Missing JSDoc Comments

Helper functions like _to_source_node and _extract_tool_result_summary lack docstrings
Frontend utility functions lack JSDoc

🎯 Critical Issues

None Found ✅

The PR is well-implemented with no blocking issues. The suggestions above are enhancements.

📊 Code Metrics

Lines Added: 1,452
Lines Deleted: 69
Files Changed: 23
Test Coverage: Strong (3 test files, 864+ lines of tests)
Documentation: CHANGELOG updated, inline comments present

🚀 Recommendations for Merge

Before Merge (Optional but Recommended):

Add XSS sanitization for tool results
Fix timeout race condition in handleMouseLeave
Add null check in _extract_tool_result_summary
Add missing ARIA label enhancement

Follow-up PRs (Non-blocking):

Add unicode and edge case test coverage
Add JSDoc comments to utility functions
Consider performance optimization for very large timelines (100+ entries)

💡 Suggestions

Future Enhancements:

Tool execution time: Show duration for each tool call
Failed tool calls: Visual indicator when tool execution fails
Collapsible tool groups: When 5+ tools used, group similar tools
Copy functionality: Button to copy tool inputs/outputs for debugging
Timeline export: Download full timeline as JSON for analysis

Consistency with CLAUDE.md:

✅ No Claude/Claude Code credits in commits
✅ Tests follow parallel testing patterns
✅ Uses constants file for magic values
✅ Follows DRY principle
✅ Single Responsibility Principle maintained
✅ CHANGELOG updated with proper format

Final Verdict: APPROVE with Minor Suggestions ✅

This is a high-quality PR that adds valuable functionality with excellent test coverage and documentation. The implementation follows project conventions and architectural patterns. The suggested improvements are minor and non-blocking.

Estimated Risk: Low
Merge Recommendation: Approve (address security suggestions before or after merge based on urgency)

claude · 2026-02-16T23:30:59Z

Code Review - PR #880: Tool Usage Visibility

I've completed my review of this PR. Overall, this is excellent work with strong implementation quality, comprehensive testing, and good documentation. Here's my detailed feedback:

✅ Strengths

Architecture & Design

Clean separation of concerns: Timeline building logic is properly centralized in TimelineBuilder rather than duplicated across adapters
Framework-agnostic approach: The TimelineEntry schema and UnifiedStreamEvent pattern ensure this will work consistently across different LLM frameworks
Progressive enhancement: The feature gracefully degrades when tool results aren't available (shows tool name only)
Accessibility-first: Full ARIA attributes, keyboard navigation (Enter/Space/Escape), and proper roles

Code Quality

DRY principle: The _extract_tool_result_summary() helper eliminates 4x duplicated result extraction logic (lines 123-155 in pydantic_ai_agents.py)
Defensive programming: Proper handling of missing/malformed data, truncation at source to prevent bloat
Type safety: Uses Record<string, unknown> instead of any for args, proper TypeScript types throughout
Memory leak prevention: useEffect cleanup for timeout refs (ChatMessage.tsx:1187-1193)

Testing

Comprehensive coverage:
- 14 backend tests covering pairing, truncation, edge cases
- 12 frontend unit tests for extraction logic
- 10 functional Playwright tests covering interaction patterns
- 3 documentation screenshot tests
Test quality: Tests are well-named, focused, and cover both happy paths and edge cases
Proper use of --reporter=list in CI workflow to prevent hanging (screenshots.yml:50)

Documentation

Excellent changelog: Includes file paths, line numbers, and impact statements (CLAUDE.md compliant)
Auto-screenshot system: Smart use of CI to keep docs fresh
JSDoc comments: Good inline documentation on helper functions

🔍 Issues Found

Critical Issues

None found.

Code Quality Issues

1. Inconsistent truncation location (Minor)

Location: pydantic_ai_agents.py:147-148 and timeline_utils.py:83-89

The PR description claims truncation happens "at source" in _extract_tool_result_summary(), but actually:

_extract_tool_result_summary() truncates to 500 chars (line 147-148)
TimelineBuilder.add() ALSO truncates to 500 chars (line 83-89)

Impact: Double truncation logic creates maintenance burden. If MAX_TOOL_RESULT_LENGTH changes, both need updating.

Recommendation: Remove truncation from TimelineBuilder.add() since it's already done at source. Update line 81 comment to note that truncation is already applied upstream.

2. Pairing algorithm could be more efficient (Minor)

Location: ChatMessage.tsx:1132-1163

The current O(n²) nested loop works fine for typical timelines (5-10 entries), but could be O(n) with a single pass using a stack.

Current approach:

for (let i = 0; i < timeline.length; i++) {
  if (entry.type !== "tool_call") continue;
  for (let j = i + 1; j < timeline.length; j++) {  // O(n²)
    if (timeline[j].type === "tool_result" && timeline[j].tool === entry.tool) {
      // pair found
    }
  }
}

Impact: Negligible for real-world usage (timelines rarely exceed 20 entries), but worth noting.

Recommendation: Keep current implementation for clarity, but add a comment noting the complexity tradeoff.

3. word-break CSS property (Very Minor)

Location: ChatMessage.tsx:207

Changed from break-all to break-word. While break-word provides better UX, it's deprecated in favor of overflow-wrap: break-word.

Recommendation: Use overflow-wrap: break-word instead (better browser support going forward).

💡 Suggestions

Performance Optimizations

1. Memoize tool call extraction

The extractToolCalls function is already memoized via useMemo (line 1182), which is good. However, the memoization key is the entire timeline array, which creates a new reference on every update.

Suggestion: Consider memoizing based on timeline.length or a hash of tool call entries only, though the current approach is probably fine for typical use cases.

2. Debounced hover close timeout

The 200ms delay (line 1206) is reasonable, but consider making it a constant at the file level for easier tuning.

Accessibility Enhancements

1. Popover focus management

When the popover opens via keyboard (Enter/Space), focus should move into the popover for screen reader navigation. Currently focus stays on the badge.

Suggestion: Add autoFocus to the first focusable element in the popover, or use a focus trap pattern.

2. Live region for tool results

When streaming tool results in real-time, screen reader users don't know when results arrive. Consider adding aria-live="polite" to the result blocks.

Security

1. XSS prevention documentation is excellent

The comment on lines 841-843 is perfect:

// XSS safety note: Tool result and argument content is rendered as React text
// nodes (not via dangerouslySetInnerHTML), so HTML entities are auto-escaped.
// Content originates from backend agent tool execution, not user input.

No issues here - just wanted to highlight this as a best practice example.

📋 Minor Observations

Frontend

Magic numbers: The 500ms animation settle time (ToolUsagePopover.ct.tsx:392) could be a named constant
Truncation UX: The truncation suffix includes original char count, which is great UX (timeline_utils.py:88)
Stable React keys: Good use of timeline index for stable keys (ChatMessage.tsx:1143)

Backend

Logging levels: Good use of logger.debug for diagnostic logs (not logger.error as mentioned in the changelog)
Type coercion: str() fallback for non-string tool results is defensive (timeline_utils.py:86)
Forward-only pairing: The assumption that tool_result always follows tool_call is documented well (ChatMessage.tsx:1129-1130)

CI/Screenshots

Rebase conflict handling: The workflow has robust fallback logic (screenshots.yml:88-106), though the complexity suggests this might be a common issue
Viewport sizes: Good use of larger viewports (1440x720, 1536x864) for readability

🎯 Test Coverage

What's tested:

✅ Tool call/result pairing (including duplicates)
✅ Truncation logic
✅ Missing tool names (TOOL_UNKNOWN_LABEL fallback)
✅ Keyboard navigation
✅ ARIA attributes
✅ Hover interactions
✅ Multiple tool calls
✅ Pending tool state (no result yet)

What could be tested:

⚠️ Tool result with HTML/script tags - verify XSS prevention in practice
⚠️ Very long tool names (>50 chars) - does the badge truncate/wrap properly?
⚠️ Popover positioning edge cases - what happens when badge is at viewport edge?

None of these are critical, but worth considering for future test additions.

📊 Performance Considerations

Positive:

✅ Truncation at source prevents large payloads in WebSocket
✅ Memoization prevents unnecessary re-extraction
✅ Conditional rendering (only show popover when open)

Potential concerns:

⚠️ The timeline is stored in ChatMessage.data['timeline'] as JSON - for conversations with 100+ messages, this could add up. Consider monitoring DB size impact.
⚠️ The popover uses Framer Motion for animations, which is ~50KB gzipped. Since this is already in use elsewhere in the app, no issue.

🔐 Security Assessment

Threat Model:

XSS via tool results: ✅ Mitigated (React text nodes auto-escape)
XSS via tool names: ✅ Mitigated (formatToolName only does case transformation)
Content injection: ✅ No dangerouslySetInnerHTML usage
Data leakage: ✅ Tool results are scoped to message permissions (already verified in corpus/document permissions)

No security concerns found.

🏁 Conclusion

This PR is ready to merge pending minor fixes:

Required before merge:

Remove duplicate truncation logic from TimelineBuilder.add() OR _extract_tool_result_summary()
Update word-break: break-word to overflow-wrap: break-word

Recommended (can be follow-up):

Add focus management to popover for screen readers
Extract magic numbers (200ms, 500ms) to named constants
Add test for HTML in tool results to verify XSS prevention

Nice to have:

Consider aria-live for streaming results
Add viewport edge case tests for popover positioning

🎖️ Overall Assessment

Code Quality: ⭐⭐⭐⭐⭐ (5/5)
Test Coverage: ⭐⭐⭐⭐⭐ (5/5)
Documentation: ⭐⭐⭐⭐⭐ (5/5)
Security: ⭐⭐⭐⭐⭐ (5/5)
Performance: ⭐⭐⭐⭐ (4/5)

Total: ⭐⭐⭐⭐⭐ Excellent work!

This feature significantly improves agent transparency and user trust. The implementation follows all project conventions, has comprehensive tests, and demonstrates strong engineering practices.

cc: @JSv4

Fix prettier formatting in tool usage frontend files

0ae6255

https://claude.ai/code/session_01GHGe5Ye7Ag9uDrw31mSBXc

Merge remote-tracking branch 'origin/main' into claude/expose-tool-us…

75d4590

…age-k5Q20 # Conflicts: # CHANGELOG.md

JSv4 added 2 commits February 16, 2026 13:50

Merge remote-tracking branch 'origin/main' into claude/expose-tool-us…

4cb75d1

…age-k5Q20

github-actions Bot and others added 6 commits February 16, 2026 21:11

Update auto-generated documentation screenshots

5273bda

Add tool usage documentation to corpus querying walkthrough

8cc5a69

Expand step-7 with screenshots showing the tool usage badge, single-tool popover, and multi-tool popover. Includes a reference table of available agent tools.

Update auto-generated documentation screenshots

773802e

Replace static tool table with link to core_tools.py source

e05598f

Merge remote-tracking branch 'origin/main' into claude/expose-tool-us…

00ee950

…age-k5Q20

github-actions Bot and others added 2 commits February 16, 2026 23:21

Update auto-generated documentation screenshots

9099766

Expand screenshot viewports 20% for better readability

0873e63

Update auto-generated documentation screenshots

5177349

JSv4 merged commit 8a743b7 into main Feb 16, 2026
4 checks passed

JSv4 deleted the claude/expose-tool-usage-k5Q20 branch February 16, 2026 23:51

Uh oh!

Conversation

JSv4 commented Feb 11, 2026

Summary

Key Changes

Frontend (ChatMessage Component)

Backend (Timeline & Agent Updates)

Data Model Updates

Implementation Details

Uh oh!

claude Bot commented Feb 11, 2026

Code Review: PR #880 - Add tool usage visibility to chat messages

Overall Assessment

✅ Strengths

🔴 Critical Issues

1. Missing Cleanup in ToolUsageIndicator (frontend/src/components/widgets/chat/ChatMessage.tsx:1158)

2. Tool Call Pairing Logic May Fail (frontend/src/components/widgets/chat/ChatMessage.tsx:1119-1142)

3. Potential XSS Risk (frontend/src/components/widgets/chat/ChatMessage.tsx:1095)

⚠️ Significant Issues

4. No Test Coverage

5. Accessibility Issues

6. Missing Error Handling (opencontractserver/llms/agents/pydantic_ai_agents.py:772-804)

📝 Code Quality Issues

7. Inconsistent Naming (opencontractserver/llms/agents/pydantic_ai_agents.py)

8. Duplicate Logic

9. Magic String (frontend/src/components/widgets/chat/ChatMessage.tsx:1125)

10. Type Safety Issue

🔍 Performance Considerations

11. Unnecessary Re-renders

12. Missing Index Keys

🔒 Security Concerns

13. No Rate Limiting on Popover Interactions

14. Result Truncation Edge Case

📋 Missing Documentation

15. No JSDoc Comments

16. Timeline Schema Documentation

🎯 Recommendations

High Priority (Must Fix Before Merge):

Medium Priority (Should Fix):

Low Priority (Nice to Have):

🧪 Testing Recommendations

📊 Impact Assessment

Verdict

Uh oh!

claude Bot commented Feb 11, 2026

PR Review: Add tool usage visibility to chat messages with popover details

Summary

Code Quality: Excellent ✅

Strengths

Uh oh!

claude Bot commented Feb 11, 2026

Best Practices Adherence ✅

Uh oh!

claude Bot commented Feb 11, 2026

Areas for Consideration

Uh oh!

claude Bot commented Feb 11, 2026

Performance Considerations ✅

Security Concerns ✅

Test Coverage ✅

Uh oh!

claude Bot commented Feb 11, 2026

Bugs or Issues

Minor Nit

Overall Assessment

Uh oh!

claude Bot commented Feb 11, 2026

PR Review: Tool Usage Visibility in Chat UI

✅ Strengths

1. Comprehensive Test Coverage

2. Security and XSS Prevention

3. Code Quality

4. Accessibility

5. Performance Considerations

6. Changelog and Documentation

🔍 Issues and Recommendations

Critical: Potential Pairing Bug in extractToolCalls

Minor: Inconsistent Error Logging

Code Style: Trailing Commas

🔒 Security Review

codecov Bot commented Feb 12, 2026 •

edited

Loading