Skip to content

Latest commit

 

History

History
286 lines (225 loc) · 12.4 KB

File metadata and controls

286 lines (225 loc) · 12.4 KB

Insight-First Pipeline: Project Log

Quick Reference

Database Requirements

The pipeline requires a SQLite database with:

  • chunks table: 13,513+ filtered transcript chunks
  • documents table: episode metadata with video URLs

Original implementation used lennys_full.db with 303 podcast episodes.

Data Pipeline Overview

  1. Source: Raw transcript files (e.g., YouTube transcripts)
  2. Processing: Convert to structured format with timestamps
  3. Ingest: Load into SQLite database
  4. Extraction: modal_extract.py reads chunks from database
  5. Threading: find_threads_v2.py discovers connections
  6. Naming: name_clusters.py generates thread names
  7. Quality: check_thread_quality.py validates output

Repository Structure

insights_first/
├── README.md                   # Quick start and pipeline overview
├── PROJECT_LOG.md              # This file - complete project history
├── FINAL_SUMMARY.md            # Executive summary
├── requirements.txt            # Python dependencies
├── .gitignore                  # Git exclusions (data files, cache)
├── modal_extract.py            # Step 1: Extract insights (Modal + vLLM)
├── find_threads_v2.py          # Step 2: Discover threads (Louvain)
├── name_clusters.py            # Step 3: Name threads (Modal + LLM)
├── check_thread_quality.py     # Step 4: Validate quality (Modal + LLM)
├── add_thread_descriptions.py  # Step 5: Add thread descriptions
├── enrich_with_video.py        # Utility: Add video URLs
├── create_final_export.py      # Utility: Create curated output
├── find_debates.py             # Experimental: Debate detection
└── data/                       # Output directory (gitignored)
    ├── threads_final.json              # Final curated output
    ├── modal_extraction_*.json         # Extracted insights
    ├── threads_v2_*.json               # Raw threading output
    └── named_threads_*.json            # Named threads

Project Evolution

Original Goal

Find "invisible threads" across Lenny's Podcast transcripts — non-obvious insights that connect multiple conversations, inspired by Pieter Levels' syntopic reading project.

Where We Started

The original Breadcrumbs pipeline:

  • Chunked transcripts → embedded chunks → K-means clustering → Claude labels clusters
  • Result: 28 themes, all WEAK or REJECT quality (avg novelty 2.8/10)
  • Root cause: Clustered text similarity, not insight similarity (intros with intros, ads with ads)

Where We Are Now

Insight-first pipeline:

  • Extract insights first (strict filtering) → embed topics → find connections
  • Result: 465 high-quality insights (8.1 novelty), 20 threads (116 insights, 25% coverage)
  • 125% improvement in novelty scores

What We Built

1. Insight Extraction (Modal + vLLM)

File: insights_first/modal_extract.py

  • Uses Qwen2.5-7B-Instruct on Modal GPUs
  • Strict extraction prompt requiring SPECIFIC + NON-OBVIOUS + ACTIONABLE
  • 3.4% extraction rate (465 insights from 13,513 chunks)
  • Avg novelty 8.1/10, specificity 8.3/10

2. Thread Discovery (Graph-based)

File: find_threads_v2.py

  • Embeds insights with sentence-transformers
  • Builds similarity graph, uses Louvain community detection
  • Finds natural clusters of related insights
  • Topic-based embedding (not full insight text)

3. Thread Naming (Modal + LLM)

File: name_clusters.py

  • Anti-fluff prompt requiring concrete, specific names
  • Identifies threads without clear themes as NO_CLEAR_THREAD
  • Note: LLM naming often produces ALL_CAPS - manual curation recommended

4. Quality Checker (Modal + LLM)

File: check_thread_quality.py

  • Evaluates: Theme Specificity, Insight Alignment, Novelty, Actionability
  • Verdicts: STRONG / MODERATE / WEAK / REJECT

5. Debate Finder (Topic-Stance Approach) [EXPERIMENTAL]

File: find_debates.py

  • Step 1: LLM extracts TOPIC + STANCE from each insight
  • Step 2: Group insights by topic similarity (embeddings on topics, not insights)
  • Step 3: LLM checks for genuine opposition within topic groups
  • Status: High false positive rate (95%+) - LLM treats emphasis differences as opposition
  • Not included in final output

6. Video Linking

File: enrich_with_video.py (for existing data) Built into: modal_extract.py (for new extractions)

  • Joins video_url from documents table
  • Joins timestamp_start from chunks table
  • Generates timestamp_url like https://youtube.com/watch?v=xyz&t=1633
  • Every insight is now directly linkable to the exact moment in the YouTube video

Key Decisions Made

Accepted ✓

Decision Reasoning Outcome
Extract insights BEFORE clustering Original pipeline clustered noise together 125% novelty improvement
Use Modal + vLLM for batch processing Local Ollama too slow (8s/chunk), Claude API expensive 13K chunks in ~17 min
Use Qwen2.5-7B (not Qwen3) Qwen3's thinking mode caused empty responses via API Works reliably
Graph-based threading (Louvain) K-means forced artificial clusters Natural groupings emerge
min_size=2 for threads 2-insight threads still valuable if from different guests 20 threads (8 major + 12 emerging)
Same-guest filtering 2-insight threads from same guest are likely duplicates Removed 3 low-quality threads
Strict naming prompt (anti-fluff) Avoid grandiose names like "Strategic Excellence" More concrete names
Filter sponsor content Polluted original clustering Cleaner insights
min_episodes=3 filtering Single-source threads aren't "invisible" Only multi-episode threads pass
Episode context in naming prompt LLM was naming threads without knowing they span conversations Better aligned names

Rejected ✗

Decision Why Rejected
Qwen3-0.6B model No discrimination (100% extraction rate, all novelty=5)
Qwen3-4B/8B models Thinking mode produces empty responses, 40+ sec/chunk
HDBSCAN clustering Too aggressive - put 348/465 insights in one cluster
K-means forced clustering Artificial groupings, not natural connections
Connected components Chain effects merged unrelated insights
High similarity threshold (0.65) Only 35 edges, missed connections
Embedding full insights Captures vocabulary not concepts (99th %ile similarity: 0.496)
Debate detection LLM treats nuance differences as opposition, 95%+ false positives
Automatic thread naming Generated ALL_CAPS_WITH_UNDERSCORES, many NO_CLEAR_THREAD

What Worked

  1. Insight-first approach: Extract quality first, cluster second
  2. Modal for parallel processing: 10x faster than local, affordable
  3. Strict extraction prompts: 3.4% rate but high quality (8.1 avg novelty)
  4. Louvain community detection: Finds dense subgroups
  5. Topic extraction + embedding: Captures conceptual similarity not vocabulary
  6. Multi-episode filtering: Rejecting single-source "threads" (min 2 different episodes)
  7. min_size=2 with same-guest filtering: Improved coverage 20% → 25% while maintaining quality
  8. Deduplication: Only 1 insight per episode per thread
  9. Video linking: Every insight has direct YouTube timestamp URL
  10. Manual curation: LLM naming failed, human curation produced clear names

What Didn't Work

  1. Small local models (0.6B-4B): Can't discriminate quality
  2. Qwen3's thinking mode: Incompatible with API usage
  3. Embedding full insights: Captures vocabulary not concepts
    • 99th percentile similarity was only 0.496
    • Different words for same concept → low similarity
    • High similarity = near-duplicates, not conceptual matches
  4. Debate detection: Fundamentally flawed
    • LLM treats emphasis differences as oppositions
    • "Prioritize X" vs "Don't solely rely on X" → marked as opposition but both recommend using X
    • 9.5% acceptance rate but 95%+ false positives
    • Cannot distinguish: genuine opposition vs different emphasis vs complementary views
  5. Automatic thread naming: LLM produced ALL_CAPS_WITH_UNDERSCORES or NO_CLEAR_THREAD
  6. Single-source threads: Clusters from one episode aren't "invisible threads"

Current Gaps & Future Ideas

Thread Coverage

  • 116/465 insights (25%) are in threads
  • 349 insights are "unique" - no strong connections found
  • Adding min_size=2 improved coverage from 20% → 25%
  • Same-guest filtering removed 3 duplicate threads

Unconnected Insights

  • 349 high-quality insights not in threads could be valuable standalone content
  • Each is still a non-obvious, actionable insight from the podcast

Final Output Files

threads_final.json (RECOMMENDED FOR FRONTEND)

  • 20 high-quality invisible threads
    • 8 major threads (3+ insights each)
    • 12 emerging threads (2 insights each)
  • 116 insights across 465 total (25% coverage)
  • All threads span 2+ different episodes
  • Same-guest duplicates filtered out (3 threads removed)
  • Thread names manually curated for clarity
  • All insights deduplicated (1 per episode max)
  • All insights have novelty 8-9/10

modal_extraction_20260120_024600.json

  • 465 extracted insights from 13,513 chunks
  • Avg novelty: 8.1/10, Avg specificity: 8.3/10
  • Enriched with video URLs and timestamps

Final Metrics

Metric Original Pipeline V1 (Insight Embedding) V2 (Topic Embedding) Final (Curated)
Approach K-means on chunks Embed full insights Embed extracted topics Topic + manual curation
Threads 28 7 18 (before filter) 8
Insights in threads N/A 37 (8%) 253 (54%) 92 (20%)
Quality WEAK/REJECT Manually validated Many NO_CLEAR_THREAD High quality
Avg Novelty 2.8/10 ~7/10 ~8/10 8.1/10
Deduplication N/A None None Yes (1/episode)
Thread Names Generic Good Poor (ALL_CAPS) Manually curated

How to Run the Pipeline

# Step 1: Extract insights from chunks (includes video URLs by default)
modal run modal_extract.py --db your_database.db

# Step 2: Find threads (records episode count per cluster)
python find_threads_v2.py --input data/modal_extraction_TIMESTAMP.json --min-episodes 2

# Step 3: Name threads (pre-filters by min_episodes, shows episode context to LLM)
modal run name_clusters.py --input data/threads_TIMESTAMP.json --min-episodes 2

# Step 4: Check quality (pre-filters by min_episodes)
modal run check_thread_quality.py --input data/named_threads_TIMESTAMP.json --min-episodes 2

# Step 5: Add descriptions to 2-insight threads
modal run add_thread_descriptions.py

# Step 6: Create final curated export
python create_final_export.py

# (Optional) Enrich existing data with video URLs
python enrich_with_video.py --db your_database.db

Note: The --min-episodes flag (default 2) ensures only threads spanning multiple episodes qualify as "invisible threads". Single-source clusters are filtered out.


Deprecated/Removed Files

The following were removed during cleanup (2026-01-20):

  • cluster_insights.py - Old HDBSCAN/K-means approach (replaced by find_threads_v2.py)
  • find_contradictions.py - v1 embedding similarity approach (replaced by find_debates.py)
  • backend/ - Old Ollama-based extraction (replaced by Modal)
  • scripts/ - Old local extraction scripts
  • Intermediate data files (old threads, clusters, validation)

Removed during pipeline fixes (2026-01-23):

  • validate_threads.py - Temporary script for validating threads (now built into pipeline)

Current Repository Files

Active Scripts:

  • modal_extract.py / modal_extract_pg.py - Extract insights (Lenny's / Paul Graham)
  • find_threads_v2.py - Thread discovery (current version)
  • find_threads.py - Legacy threading (kept for reference)
  • name_clusters.py - Thread naming
  • check_thread_quality.py - Quality validation
  • add_thread_descriptions.py - Add descriptions to threads
  • create_final_export.py - Create curated output
  • create_clean_threads_v2.py - Thread cleanup utilities
  • enrich_with_video.py - Add video URLs
  • list_threads.py - List threads in files
  • merge_pairs.py - Merge thread pairs
  • fix_pg_threads.py - Fix Paul Graham threads

Experimental:

  • find_debates.py / validate_debates.py - Debate detection (high false positive rate)

Last updated: 2026-01-24 (Updated documentation for GitHub)