Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
368 changes: 368 additions & 0 deletions .ppg/results/ag-t0k4em1g.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,368 @@
# Result: ag-t0k4em1g

## PR
(pending — review-only, no PR created)
Comment on lines +3 to +4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix PR metadata inconsistency in the report header.

Line 4 says “pending — review-only, no PR created”, but this report is tied to an existing PR context. This makes the artifact self-contradictory and can confuse downstream readers/tools.

Suggested edit
 ## PR
-(pending — review-only, no PR created)
+https://github.com/2witstudios/PageSpace/pull/727 (open)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## PR
(pending — review-only, no PR created)
## PR
https://github.com/2witstudios/PageSpace/pull/727 (open)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.ppg/results/ag-t0k4em1g.md around lines 3 - 4, Replace the contradictory
header phrase "pending — review-only, no PR created" with an accurate status
that reflects this artifact is tied to an existing PR (for example "pending —
review of existing PR" or "pending — review associated with PR"); update the
report header string where it emits that exact phrase so the PR metadata and
report header remain consistent and, if present, reconcile any related metadata
fields (PR status/PR id) in the same file to match the new header text.


## Summary
Comprehensive code review of PR #708: "Add proactive context window management to prevent AI API errors"

## Changes Reviewed
- `apps/web/src/app/api/ai/chat/route.ts` — Context truncation logic in POST handler + error handling
- `apps/web/src/components/layout/right-sidebar/ai-assistant/SidebarChatTab.tsx` — Centralized error messaging
- `apps/web/src/lib/ai/shared/error-messages.ts` — New `isContextLengthError()` + `getAIErrorMessage()` ordering fix
- `packages/lib/src/monitoring/ai-context-calculator.ts` — CJK-aware tokenizer + OpenRouter model registry

---

# 🔬 **COMPREHENSIVE CODE REVIEW: PR #708 — Proactive Context Window Management**

## **PR Scope Analysis**

**Stated Goals:**
1. Proactive context window management to prevent AI API token limit errors
2. Enhanced model-specific context window detection for OpenRouter
3. Context length error detection and user-friendly error messaging
4. Centralized error messaging in the UI

**Files Changed:** 4 files, +137 / -26 lines

---

## **1. Code Structure & Organization** ✅

### **Good Decisions**

- **Separation of concerns**: Token estimation stays in `@pagespace/lib`, error classification stays in `apps/web/src/lib/ai/shared`, and orchestration stays in the route handler
- **Reuse of existing infrastructure**: Builds on existing `ai-context-calculator.ts` functions rather than duplicating
- **UI centralization**: Moving inline error detection from `SidebarChatTab.tsx` to `getAIErrorMessage()` is a clear DRY improvement

### **Concern: Route handler complexity**

The `chat/route.ts` is already a very large file. Adding 35 more lines of truncation logic adds to its cognitive weight. This is acceptable for now, but as the route grows, the truncation+budgeting block should eventually be extracted into a utility function (e.g., `truncateConversationForModel()`).

---

## **2. JavaScript/TypeScript Standards** ⚠️

### **2.1. Type Safety Issue — `as` casts**

```typescript
// PR line in route.ts
const modelMessages = convertToModelMessages(includedMessages as UIMessage[], {
tools: filteredTools
});
```

**Issue**: The `ai-context-calculator.ts` defines its own `UIMessage` interface (a minimal subset), while the route imports `UIMessage` from the `ai` package (Vercel AI SDK). The `as UIMessage[]` cast silently coerces between two structurally different types. If the Vercel SDK's `UIMessage` adds required fields in a future version, this cast would suppress TypeScript's warning.

**Recommendation**: Instead of casting, update `determineMessagesToInclude` to accept a generic type parameter or accept the Vercel SDK's `UIMessage` type directly. Alternatively, make the `ai-context-calculator.ts` interface accept the Vercel SDK `UIMessage` as a compatible supertype.

```typescript
// Also in route.ts
const toolTokens = estimateToolDefinitionTokens(filteredTools as Record<string, unknown>);
```

This cast is documented with a comment explaining why, which is acceptable, but it's another safety bypass.

**Severity**: Medium — works today, fragile going forward.

### **2.2. Magic Numbers**

```typescript
const inputBudget = Math.floor(contextWindow * 0.75);
```

The `0.75` (25% headroom) is a critical tuning parameter. It should be a named constant:

```typescript
const OUTPUT_TOKEN_HEADROOM = 0.25;
const inputBudget = Math.floor(contextWindow * (1 - OUTPUT_TOKEN_HEADROOM));
```

**Severity**: Low — functional but hurts readability/maintainability.

### **2.3. Comment Quality** ✅

Comments are generally good. The inline comments explaining *why* truncation happens after system prompt construction are valuable. The docblocks on new functions are minimal and appropriate per `javascript.mdc`.

---

## **3. Token Estimation Accuracy** ⚠️

### **3.1. CJK-Aware Tokenizer — Good Addition, Flawed Heuristic**

```typescript
const nonAsciiCount = (text.match(/[^\x00-\x7F]/g) || []).length;
const nonAsciiRatio = nonAsciiCount / text.length;
const charsPerToken = nonAsciiRatio > 0.2 ? 2 : 4;
```

**Good**: Recognizing that CJK text tokenizes differently is a thoughtful improvement.

**Issues**:
1. **Regex on every call**: `text.match(/[^\x00-\x7F]/g)` creates a new array of all non-ASCII characters. For very large messages (e.g., pasted documents), this allocates a potentially large array just to count it. A loop or `replace()` + length difference would be more memory-efficient.
2. **Binary threshold at 20%**: Jumping from 4 chars/token to 2 chars/token at exactly 20% non-ASCII is a cliff. Mixed content (e.g., English text with some CJK) would get inconsistent estimates depending on which side of 20% it falls on. A linear interpolation might be more accurate, though the simplicity here is acceptable given this is already an estimation.
3. **Emoji inflation**: Emoji characters are non-ASCII but typically tokenize to 1-3 tokens each, similar to words. The 2 chars/token ratio over-estimates for emoji-heavy text, but this is conservative (safe direction).

**Severity**: Low-Medium — the estimation is conservative, which is the right direction for preventing context overflow. But the regex allocation is worth optimizing for large conversations.

### **3.2. Docblock is stale**

The docblock for `estimateTokens()` was NOT updated in the PR diff. The original says:
```
* Estimate tokens in a text string
* Uses 4 characters per token as a rough estimate
* This is conservative - actual token count may be slightly lower
```
Comment on lines +112 to +116
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to fenced code block.

Markdownlint MD040 is valid here (missing fence language).

Suggested edit
-```
+```text
 * Estimate tokens in a text string
 * Uses 4 characters per token as a rough estimate
 * This is conservative - actual token count may be slightly lower
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.ppg/results/ag-t0k4em1g.md around lines 112 - 116, The fenced code block
containing the three bullet lines describing token estimation is missing a
language tag (Markdownlint MD040); update that fenced block by adding a language
identifier (e.g., change ``` to ```text) so the block becomes ```text followed
by the three lines and the closing ```; this fixes the MD040 warning for the
block that starts with "* Estimate tokens in a text string".


But the implementation now uses 2 or 4 depending on content. The docblock should reflect the new CJK-aware behavior. (The PR diff shows the docblock WAS updated — good.)

Comment on lines +111 to +119
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve contradictory wording in the docblock analysis.

This section states the docblock “was NOT updated” and then immediately says the PR diff shows it “WAS updated.” Keep a single consistent verdict.

Suggested edit
-The docblock for `estimateTokens()` was NOT updated in the PR diff. The original says:
+The docblock for `estimateTokens()` needed updating because the implementation now adapts for CJK content. The PR diff shows this was updated correctly:
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 112-112: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.ppg/results/ag-t0k4em1g.md around lines 111 - 119, The commentary
contradicts itself about the estimateTokens() docblock; update the review text
to consistently state that the docblock was updated to reflect the new CJK-aware
logic (2 chars per token for CJK and 4 for others) and remove the sentence
claiming it "was NOT updated", ensuring the analysis references the function
name estimateTokens() and the new CJK-aware behavior only once and clearly.

---

## **4. Context Window Model Registry** ⚠️

### **4.1. OpenRouter Model Detection — Comprehensive but Brittle**

The new OpenRouter block adds model-specific context limits for Claude, Gemini, GPT, DeepSeek, Qwen, Llama, and Mistral. This is a clear improvement over the previous generic 200k fallback.

**Issues**:

1. **Hardcoded model-version strings**: Values like `gpt-5.2`, `gpt-5.1`, `deepseek-r1`, `qwen-2.5` will go stale as new models are released. There's no mechanism to refresh these — they need manual code updates.

2. **Duplicate logic**: The GPT model context sizes in the OpenRouter block duplicate the logic already present in the `providerLower === 'openai'` block above. If OpenAI context limits change, both blocks need updating. This violates DRY.

```typescript
// Already exists for providerLower === 'openai':
if (modelLower.includes('gpt-5.2')) { ... }

// Duplicated for providerLower === 'openrouter':
if (modelLower.includes('gpt-5.2')) { ... }
```

**Recommendation**: Extract model-family matching into shared helper functions, or have the OpenRouter branch fall through to existing provider blocks when a known model family is detected.

3. **Order-dependent matching**: `gpt-5` matches before checking for `gpt-5.1` or `gpt-5.2` only because those checks come first. This is correct but fragile — reordering the if-chain would break it. Consider using explicit version parsing or more specific patterns.

4. **Gemini via OpenRouter is inaccurate**: The PR uses `1_000_000` for all Gemini 2.x models via OpenRouter, but the direct Google provider block above correctly distinguishes:
- `gemini-2.5-pro` → 2,000,000
- `gemini-2.5-flash` → 1,000,000

The OpenRouter block flattens all Gemini 2.x to 1M, which underestimates Gemini Pro's context window.

**Severity**: Medium — functional today, maintenance burden going forward.

---

## **5. Error Handling** ⚠️

### **5.1. `isContextLengthError()` — False Positive Risk**

```typescript
errorMessage.includes('413') ||
```

Checking for the string `'413'` anywhere in an error message is too broad. A message like "Processed 413 items" or a request ID containing "413" would trigger a false positive. This should check for HTTP status code patterns more specifically:

```typescript
// Better:
msg.includes('status 413') || msg.includes('http 413') || msg.includes('413 ')
```

Similarly, the `(msg.includes('maximum') && msg.includes('tokens'))` check could match messages like "Maximum number of tokens used for billing" which isn't a context length error.

**Severity**: Low-Medium — false positives would show a misleading error message but wouldn't break functionality.

### **5.2. Error Response Structure — Inconsistent with Existing Pattern**

The new 413 response:
```typescript
return NextResponse.json(
{
error: 'context_length_exceeded',
message: wasTruncated ? '...' : '...',
details: 'context_length_exceeded',
},
{ status: 413 }
);
```

The existing 500 response:
```typescript
return NextResponse.json({
error: 'Failed to process chat request. Please try again.'
}, { status: 500 });
```

These have different shapes. The 413 response has `error` (code), `message` (human-readable), and `details`, while the 500 has only `error` (human-readable). The client-side code in `SidebarChatTab.tsx` reads `error.message`, which would come from the Error object, not the response body.

**Question**: Does the Vercel AI SDK's `useChat` hook parse the JSON response body and set it as the Error message? If not, `getAIErrorMessage(error.message)` may never see `context_length_exceeded` because the Error's `.message` may be a generic HTTP error string, not the response body's `message` field.

**Severity**: Medium — this could mean the 413 error handling in the UI never triggers. Needs verification of how the Vercel AI SDK surfaces API error responses.

### **5.3. `isRateLimitError` Guard — Good Fix**

```typescript
if (isContextLengthError(errorMessage)) return false;
```

Good: Adding this guard prevents context length errors (which contain "limit" in "token limit") from being misclassified as rate limit errors. This is an important correctness fix.

### **5.4. `getAIErrorMessage` Ordering — Correct**

The context length check is placed before the rate limit check, which is correct because the `isRateLimitError` now early-returns false for context length errors. Even without that guard, the ordering would ensure context length errors are caught first.

---

## **6. OWASP Top 10 Security Review** ✅

Reviewing all changes against the current OWASP Top 10 (2021):

| # | Category | Status | Notes |
|---|----------|--------|-------|
| A01 | Broken Access Control | ✅ Safe | No auth changes; truncation operates on already-authenticated data |
| A02 | Cryptographic Failures | ✅ N/A | No cryptographic operations introduced |
| A03 | Injection | ✅ Safe | No user input reaches SQL, shell, or template contexts; error messages don't leak raw user content |
| A04 | Insecure Design | ✅ Safe | Truncation is a defense-in-depth measure |
| A05 | Security Misconfiguration | ✅ Safe | No configuration changes |
| A06 | Vulnerable Components | ✅ Safe | No new dependencies introduced |
| A07 | Auth Failures | ✅ Safe | No auth logic modified |
| A08 | Data Integrity Failures | ⚠️ Minor | The error response leaks `wasTruncated` state and model/provider info indirectly through differentiated error messages. An attacker could infer conversation size. Low risk. |
| A09 | Logging Failures | ✅ Good | Truncation events are properly logged with metrics |
| A10 | SSRF | ✅ N/A | No external URL construction |

**Overall Security Assessment**: No OWASP violations. The minor A08 concern (information disclosure via differentiated error messages) is acceptable for UX purposes.

---

## **7. Test Coverage** ❌

### **No Tests Added**

This is the most significant gap in the PR. The changes introduce:

1. **New behavior in `estimateTokens()`** — CJK-aware tokenization
2. **New function `isContextLengthError()`** — Pattern matching on error strings
3. **Modified function `isRateLimitError()`** — New guard clause
4. **Modified `getAIErrorMessage()`** — New context length branch
5. **30+ new OpenRouter model entries** in `getContextWindowSize()`
6. **Complex truncation orchestration** in the route handler

None of these have test coverage. Per `tdd.mdc`, tests should be written first (TDD). The existing `ai-context-calculator.ts` has zero test files (`Grep` confirms no `*.test.*` files reference it).

**Minimum test coverage needed:**
- `estimateTokens()` — Latin text, CJK text, mixed content, empty string, emoji
- `isContextLengthError()` — True positives for each pattern, false negative for unrelated errors, false positive edge cases (the `413` string issue)
- `isRateLimitError()` — Verify context length errors are excluded
- `getContextWindowSize()` — OpenRouter model detection for each family
- `determineMessagesToInclude()` — Already existed but has no tests

**Severity**: High — this is the primary blocker for production readiness.

---

## **8. Performance Considerations** ✅

- **Token estimation**: O(n) per message, called once per request. Acceptable.
- **Truncation loop**: O(n) reverse iteration over messages. Efficient.
- **Regex in `estimateTokens`**: The `text.match(/[^\x00-\x7F]/g)` creates an array allocation per call. For conversations with many large messages, this could add GC pressure. Consider using a loop or `text.replace(/[\x00-\x7F]/g, '').length` instead.
- **No redundant work**: System prompt tokens and tool tokens are computed once, not per-message.

---

## **9. UI/UX Implementation** ✅

### **SidebarChatTab Centralization — Clean Refactor**

The replacement of 10 lines of inline error detection with `getAIErrorMessage(error.message)` is a clean improvement. The function provides consistent messaging across the app.

**Minor note**: The error message for context length errors says "Older messages have been trimmed to fit — try sending your message again." This is potentially confusing if the error occurred *despite* trimming (the `wasTruncated` path in the route handler sends a different message, but the UI function doesn't differentiate). The route's differentiated 413 messages may not reach this code path (see 5.2).

---

## **10. Architectural Patterns** ⚠️

### **10.1. `originalMessages` Inconsistency**

After truncation, `modelMessages` uses `includedMessages` (truncated), but `createUIMessageStream({ originalMessages: sanitizedMessages })` still uses the full `sanitizedMessages` (untruncated).

If `originalMessages` is used for anything beyond display (e.g., delta computation), this mismatch could cause subtle bugs — the stream thinks the full conversation is present while the AI only sees the truncated subset.

**Severity**: Medium — needs verification of how `originalMessages` is consumed downstream.

### **10.2. Truncation Happens After System Prompt — Correct**

The PR correctly builds the system prompt first, then calculates the budget with system prompt tokens deducted. This is the right ordering since system prompts vary based on page context, drive configuration, and personalization.

### **10.3. Budget Calculation is Sound**

```
contextWindow → × 0.75 → inputBudget → minus systemPromptTokens → minus toolTokens → message budget
```
Comment on lines +298 to +300
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to fenced code block.

Markdownlint MD040 is also valid for this fence.

Suggested edit
-```
+```text
 contextWindow → × 0.75 → inputBudget → minus systemPromptTokens → minus toolTokens → message budget
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 298-298: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.ppg/results/ag-t0k4em1g.md around lines 298 - 300, The fenced code block
containing "contextWindow → × 0.75 → inputBudget → minus systemPromptTokens →
minus toolTokens → message budget" should include a language tag to satisfy
Markdownlint MD040; update the fence from ``` to ```text (or another appropriate
tag) so the block becomes ```text ... ``` to explicitly mark the code block
language.


This pipeline correctly reserves space for output tokens and accounts for all non-message token consumers. The 25% headroom is reasonable for most models.

---

## **Critical Findings**

### **Strengths**
1. **Proactive defense**: Truncating before API call is better than letting it fail
2. **Good logging**: Truncation events include all relevant metrics for debugging
3. **UI cleanup**: Centralized error messaging removes duplicated logic
4. **CJK awareness**: Thoughtful improvement to token estimation
5. **Correct ordering**: System prompt → budget → truncation pipeline is logically sound
6. **Rate limit guard**: Prevents misclassification of context errors as rate limit errors

### **Issues Requiring Attention**

| # | Severity | Issue | Category |
|---|----------|-------|----------|
| 1 | **High** | No tests for any new or modified functions | Test Coverage |
| 2 | **Medium** | Type casts (`as UIMessage[]`) bypass type safety between incompatible UIMessage types | Type Safety |
| 3 | **Medium** | OpenRouter model sizes duplicate existing provider-specific logic (DRY) | Maintainability |
| 4 | **Medium** | `isContextLengthError('413')` string matching may produce false positives | Correctness |
| 5 | **Medium** | 413 error response may not reach UI correctly via Vercel AI SDK error propagation | Error Handling |
| 6 | **Medium** | `originalMessages` still uses full conversation while model sees truncated subset | Consistency |
| 7 | **Low-Medium** | Gemini Pro via OpenRouter gets 1M instead of correct 2M context window | Accuracy |
| 8 | **Low** | `0.75` magic number should be a named constant | Readability |
| 9 | **Low** | Regex allocation in `estimateTokens()` for large texts | Performance |

---

## **Final Assessment**

### **Overall Score: 68/100** (Good concept, needs hardening)

**Breakdown:**

| Category | Score | Notes |
|----------|-------|-------|
| Requirements Adherence | ✅ 90% | All stated goals implemented |
| Code Quality | ⚠️ 70% | Type casts, magic numbers, DRY violations |
| Test Coverage | ❌ 0% | No tests added for any changes |
| Architecture | ⚠️ 75% | Sound approach, `originalMessages` mismatch |
| Security | ✅ 95% | Clean, no OWASP violations |
| Error Handling | ⚠️ 65% | Good detection logic, uncertain propagation to UI |

### **Production Readiness: ⚠️ CONDITIONAL**

**Blockers before merge:**
1. Add unit tests for `isContextLengthError()`, `estimateTokens()` CJK path, and `getContextWindowSize()` OpenRouter entries
2. Verify that 413 error responses from the route actually surface in `error.message` in the Vercel AI SDK's `useChat` hook

**Recommended improvements (non-blocking):**
3. Extract named constant for 25% headroom
4. Tighten `'413'` string matching to avoid false positives
5. Verify `originalMessages` vs `includedMessages` alignment in `createUIMessageStream`
6. Fix Gemini Pro via OpenRouter context window (1M → 2M)
7. Deduplicate OpenRouter model sizes with existing provider blocks

### **Recommendation: REQUEST CHANGES** — Add tests and verify error propagation

---

## Notes
- Review conducted against review.mdc, javascript.mdc, tdd.mdc, stack.mdc criteria
- OWASP Top 10 (2021) explicitly reviewed — no violations found
- No task plan found in `$projectRoot/tasks/` for this feature
- The approach is architecturally sound — the issues are in implementation details and missing tests