This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
PDF Problem Cutter - Automatically extracts individual problems and solutions from multi-column test PDF files and saves them as separate files.
Goal: PDF (test paper) → detect layout → extract problems/solutions → individual files (1_prb, 1_sol, ...) → ZIP archive
Development Philosophy: Formal Spec Driven Development
- Always write Idris2 specifications first → then implement Python code
- Type system guarantees data consistency
- Catch specification errors at compile time
# Run all tests
uv run pytest
# Run specific test file
uv run pytest tests/test_layout_detector.py
# Run with coverage
uv run pytest --cov=core
# Run test scripts
uv run python test_samples.py
uv run python test_extract_problems.py
uv run python test_new_extraction.pycd .specs
# Check individual spec files
idris2 --check System/Base.idr
idris2 --check System/LayoutDetection.idr
idris2 --check System/ProblemExtraction.idr
idris2 --check System/ExtractionWorkflow.idr
idris2 --check System/Workflow.idr
idris2 --check System/OutputFormat.idr# Detect layout from sample PDF
uv run python examples/detect_layout.py
# Full extraction workflow
uv run python run_full_extraction.py samples/통합과학_1_샘플.pdfIdris2 dependent types define the data model and provable properties:
- Base.idr: Core types (Coord, BBox, VLine, Region)
- LayoutDetection.idr: Column detection (1/2/3 columns), vertical line detection
- OcrEngine.idr: OCR result types, problem number parsing (1., ①, [정답])
- ProblemExtraction.idr: Problem/solution extraction with validity proofs
- ExtractionWorkflow.idr: 8-step workflow state machine with valid state transitions
- OutputFormat.idr: File naming (1_prb, 1_sol), ZIP packaging
- Workflow.idr: Top-level workflow orchestration
Implements the Idris2 specifications:
- base.py: Basic types (Coord, BBox, VLine)
- layout_detector.py: OpenCV-based column detection (vertical lines, content gaps)
- ocr_engine.py: OCR integration (Tesseract, EasyOCR, Claude Vision)
- problem_extractor.py: Extract problems using boundary strategies
- problem_boundary.py: Problem boundary detection algorithms
- column_linearizer.py: Multi-column → single-column conversion
- result_validator.py: Validate extraction against specifications
- workflow.py: Complete extraction workflow orchestration
- output_generator.py: File generation and ZIP packaging
Standardized tool interface for LLM agents:
- types.py: ToolResult, ToolDiagnostics - standard return format
- pdf.py: summarize_pdf(), load_pdf_images()
- layout.py: detect_page_layout(), summarize_layout()
- ocr.py: OCR result filtering/sorting (stub)
- extraction.py: find_problem_boundaries(), crop_problems(), validate_extraction()
- workflow.py: run_layout_stage(), run_full_workflow_stub()
All AgentTools functions return ToolResult with:
success: boolmessage: strdata: dictdiagnostics: ToolDiagnostics (warnings, errors, info)
- Analyze: Read PDF, detect problem count and numbers
- Create Spec: Generate type specification from analysis
- Separate Columns: Convert multi-column → single-column representation
- Extract Problems: Track vertically to extract individual problems
- Validate: Compare extracted vs. expected (from spec)
- Final Verification: Re-check against original PDF
- Generate Files: Create individual problem files with margin trimming
- Archive: Package all files into ZIP
Initial → Analyzed → SpecCreated → ColumnsSeparated → ProblemsExtracted
→ Validated → FinalVerified → FilesGenerated → Archived
↓ (retry if validation fails)
ProblemsExtracted
- Vertical lines detection (primary): Find strong vertical separators using OpenCV
- Content gaps (fallback): Analyze white space between text blocks
- Problem positions (validation): Use detected problem locations
Always use page midpoint (width // 2) for 2-column layouts when vertical lines are unclear.
- Arabic:
1.,2.,3. - Circled:
①,②,③ - Bracketed:
[1],[2],[3] - Solution markers:
[정답],[해설]
- MARKER_ONLY: Use problem number markers only
- MARKER_WITH_WHITESPACE: Markers + whitespace gaps
- COMBINED: Markers + whitespace + heuristics (recommended)
The Idris2 specifications define properties that must hold:
- NoOverlap: Problem regions don't overlap
- ValidLayout: Layout has valid column count and non-overlapping columns
- ProblemsInOrder: Problem numbers are in ascending order
- UniqueFilenames: Output filenames are unique (no collisions)
- CompleteOutput: All problems have corresponding output files
- ValidTransition: Workflow state transitions are valid
Python implementations should maintain these invariants at runtime.
- Read the corresponding
.specs/System/*.idrfile first - Understand the data types and proofs required
- Implement Python code maintaining the type invariants
- Add runtime validation matching the proof requirements
- Write tests verifying the invariants
- Check layout detection results first (column boundaries)
- Verify OCR results (problem numbers detected)
- Examine boundary detection strategy
- Validate against spec (expected vs. detected)
- Check for overlapping regions (NoOverlap violation)
from AgentTools import pdf, layout, extraction
from AgentTools.types import ToolResult
# All functions return ToolResult
result: ToolResult = pdf.summarize_pdf("sample.pdf")
if not result.success:
print(f"Error: {result.message}")
for error in result.diagnostics.errors:
print(f" - {error}")
return
# Access data from successful result
page_count = result.data["page_count"]-
Column Detection: Always detect layout before extraction. Multi-column PDFs must be linearized first.
-
Reading Order: Korean test papers typically use left-to-right, top-to-bottom order within columns. Process left column completely before right column.
-
Margin Trimming: Trim white space around extracted problems but preserve enough context. Default margin: 10-20 pixels.
-
OCR Confidence: Filter low-confidence results (< 0.4). Problem numbers should have high confidence (>= 0.7).
-
Validation Retry Logic: If validation fails, retry extraction for missing problems only (don't re-extract everything).
-
File Naming: Use format
{number}_prb.pngfor problems,{number}_sol.pngfor solutions. Pad numbers if needed (01_prb vs 1_prb).
problem_cutter/
├── .specs/System/ # Idris2 specifications (design source of truth)
├── core/ # Python implementation (matches specs)
├── AgentTools/ # LLM-agent friendly wrappers
├── tests/ # pytest test suite
├── samples/ # Test PDF samples
├── output/ # Extraction results (gitignored)
├── directions/ # Workflow documentation
└── run_full_extraction.py # Main entry point
- directions/view.md: 8-step workflow overview
- directions/cutting_pdf.md: Original requirements and design rationale
- OPENCV_GUIDE.md: OpenCV techniques for layout detection
- TEST_RESULTS.md: Test results and known issues
- README.md: Project overview and Idris2 specifications guide