The pipeline requires a SQLite database with:
chunkstable: 13,513+ filtered transcript chunksdocumentstable: episode metadata with video URLs
Original implementation used lennys_full.db with 303 podcast episodes.
- Source: Raw transcript files (e.g., YouTube transcripts)
- Processing: Convert to structured format with timestamps
- Ingest: Load into SQLite database
- Extraction:
modal_extract.pyreads chunks from database - Threading:
find_threads_v2.pydiscovers connections - Naming:
name_clusters.pygenerates thread names - Quality:
check_thread_quality.pyvalidates output
insights_first/
├── README.md # Quick start and pipeline overview
├── PROJECT_LOG.md # This file - complete project history
├── FINAL_SUMMARY.md # Executive summary
├── requirements.txt # Python dependencies
├── .gitignore # Git exclusions (data files, cache)
├── modal_extract.py # Step 1: Extract insights (Modal + vLLM)
├── find_threads_v2.py # Step 2: Discover threads (Louvain)
├── name_clusters.py # Step 3: Name threads (Modal + LLM)
├── check_thread_quality.py # Step 4: Validate quality (Modal + LLM)
├── add_thread_descriptions.py # Step 5: Add thread descriptions
├── enrich_with_video.py # Utility: Add video URLs
├── create_final_export.py # Utility: Create curated output
├── find_debates.py # Experimental: Debate detection
└── data/ # Output directory (gitignored)
├── threads_final.json # Final curated output
├── modal_extraction_*.json # Extracted insights
├── threads_v2_*.json # Raw threading output
└── named_threads_*.json # Named threads
Find "invisible threads" across Lenny's Podcast transcripts — non-obvious insights that connect multiple conversations, inspired by Pieter Levels' syntopic reading project.
The original Breadcrumbs pipeline:
- Chunked transcripts → embedded chunks → K-means clustering → Claude labels clusters
- Result: 28 themes, all WEAK or REJECT quality (avg novelty 2.8/10)
- Root cause: Clustered text similarity, not insight similarity (intros with intros, ads with ads)
Insight-first pipeline:
- Extract insights first (strict filtering) → embed topics → find connections
- Result: 465 high-quality insights (8.1 novelty), 20 threads (116 insights, 25% coverage)
- 125% improvement in novelty scores
File: insights_first/modal_extract.py
- Uses Qwen2.5-7B-Instruct on Modal GPUs
- Strict extraction prompt requiring SPECIFIC + NON-OBVIOUS + ACTIONABLE
- 3.4% extraction rate (465 insights from 13,513 chunks)
- Avg novelty 8.1/10, specificity 8.3/10
File: find_threads_v2.py
- Embeds insights with sentence-transformers
- Builds similarity graph, uses Louvain community detection
- Finds natural clusters of related insights
- Topic-based embedding (not full insight text)
File: name_clusters.py
- Anti-fluff prompt requiring concrete, specific names
- Identifies threads without clear themes as NO_CLEAR_THREAD
- Note: LLM naming often produces ALL_CAPS - manual curation recommended
File: check_thread_quality.py
- Evaluates: Theme Specificity, Insight Alignment, Novelty, Actionability
- Verdicts: STRONG / MODERATE / WEAK / REJECT
File: find_debates.py
- Step 1: LLM extracts TOPIC + STANCE from each insight
- Step 2: Group insights by topic similarity (embeddings on topics, not insights)
- Step 3: LLM checks for genuine opposition within topic groups
- Status: High false positive rate (95%+) - LLM treats emphasis differences as opposition
- Not included in final output
File: enrich_with_video.py (for existing data)
Built into: modal_extract.py (for new extractions)
- Joins
video_urlfrom documents table - Joins
timestamp_startfrom chunks table - Generates
timestamp_urllikehttps://youtube.com/watch?v=xyz&t=1633 - Every insight is now directly linkable to the exact moment in the YouTube video
| Decision | Reasoning | Outcome |
|---|---|---|
| Extract insights BEFORE clustering | Original pipeline clustered noise together | 125% novelty improvement |
| Use Modal + vLLM for batch processing | Local Ollama too slow (8s/chunk), Claude API expensive | 13K chunks in ~17 min |
| Use Qwen2.5-7B (not Qwen3) | Qwen3's thinking mode caused empty responses via API | Works reliably |
| Graph-based threading (Louvain) | K-means forced artificial clusters | Natural groupings emerge |
| min_size=2 for threads | 2-insight threads still valuable if from different guests | 20 threads (8 major + 12 emerging) |
| Same-guest filtering | 2-insight threads from same guest are likely duplicates | Removed 3 low-quality threads |
| Strict naming prompt (anti-fluff) | Avoid grandiose names like "Strategic Excellence" | More concrete names |
| Filter sponsor content | Polluted original clustering | Cleaner insights |
| min_episodes=3 filtering | Single-source threads aren't "invisible" | Only multi-episode threads pass |
| Episode context in naming prompt | LLM was naming threads without knowing they span conversations | Better aligned names |
| Decision | Why Rejected |
|---|---|
| Qwen3-0.6B model | No discrimination (100% extraction rate, all novelty=5) |
| Qwen3-4B/8B models | Thinking mode produces empty responses, 40+ sec/chunk |
| HDBSCAN clustering | Too aggressive - put 348/465 insights in one cluster |
| K-means forced clustering | Artificial groupings, not natural connections |
| Connected components | Chain effects merged unrelated insights |
| High similarity threshold (0.65) | Only 35 edges, missed connections |
| Embedding full insights | Captures vocabulary not concepts (99th %ile similarity: 0.496) |
| Debate detection | LLM treats nuance differences as opposition, 95%+ false positives |
| Automatic thread naming | Generated ALL_CAPS_WITH_UNDERSCORES, many NO_CLEAR_THREAD |
- Insight-first approach: Extract quality first, cluster second
- Modal for parallel processing: 10x faster than local, affordable
- Strict extraction prompts: 3.4% rate but high quality (8.1 avg novelty)
- Louvain community detection: Finds dense subgroups
- Topic extraction + embedding: Captures conceptual similarity not vocabulary
- Multi-episode filtering: Rejecting single-source "threads" (min 2 different episodes)
- min_size=2 with same-guest filtering: Improved coverage 20% → 25% while maintaining quality
- Deduplication: Only 1 insight per episode per thread
- Video linking: Every insight has direct YouTube timestamp URL
- Manual curation: LLM naming failed, human curation produced clear names
- Small local models (0.6B-4B): Can't discriminate quality
- Qwen3's thinking mode: Incompatible with API usage
- Embedding full insights: Captures vocabulary not concepts
- 99th percentile similarity was only 0.496
- Different words for same concept → low similarity
- High similarity = near-duplicates, not conceptual matches
- Debate detection: Fundamentally flawed
- LLM treats emphasis differences as oppositions
- "Prioritize X" vs "Don't solely rely on X" → marked as opposition but both recommend using X
- 9.5% acceptance rate but 95%+ false positives
- Cannot distinguish: genuine opposition vs different emphasis vs complementary views
- Automatic thread naming: LLM produced ALL_CAPS_WITH_UNDERSCORES or NO_CLEAR_THREAD
- Single-source threads: Clusters from one episode aren't "invisible threads"
- 116/465 insights (25%) are in threads
- 349 insights are "unique" - no strong connections found
- Adding min_size=2 improved coverage from 20% → 25%
- Same-guest filtering removed 3 duplicate threads
- 349 high-quality insights not in threads could be valuable standalone content
- Each is still a non-obvious, actionable insight from the podcast
threads_final.json (RECOMMENDED FOR FRONTEND)
- 20 high-quality invisible threads
- 8 major threads (3+ insights each)
- 12 emerging threads (2 insights each)
- 116 insights across 465 total (25% coverage)
- All threads span 2+ different episodes
- Same-guest duplicates filtered out (3 threads removed)
- Thread names manually curated for clarity
- All insights deduplicated (1 per episode max)
- All insights have novelty 8-9/10
modal_extraction_20260120_024600.json
- 465 extracted insights from 13,513 chunks
- Avg novelty: 8.1/10, Avg specificity: 8.3/10
- Enriched with video URLs and timestamps
| Metric | Original Pipeline | V1 (Insight Embedding) | V2 (Topic Embedding) | Final (Curated) |
|---|---|---|---|---|
| Approach | K-means on chunks | Embed full insights | Embed extracted topics | Topic + manual curation |
| Threads | 28 | 7 | 18 (before filter) | 8 |
| Insights in threads | N/A | 37 (8%) | 253 (54%) | 92 (20%) |
| Quality | WEAK/REJECT | Manually validated | Many NO_CLEAR_THREAD | High quality |
| Avg Novelty | 2.8/10 | ~7/10 | ~8/10 | 8.1/10 |
| Deduplication | N/A | None | None | Yes (1/episode) |
| Thread Names | Generic | Good | Poor (ALL_CAPS) | Manually curated |
# Step 1: Extract insights from chunks (includes video URLs by default)
modal run modal_extract.py --db your_database.db
# Step 2: Find threads (records episode count per cluster)
python find_threads_v2.py --input data/modal_extraction_TIMESTAMP.json --min-episodes 2
# Step 3: Name threads (pre-filters by min_episodes, shows episode context to LLM)
modal run name_clusters.py --input data/threads_TIMESTAMP.json --min-episodes 2
# Step 4: Check quality (pre-filters by min_episodes)
modal run check_thread_quality.py --input data/named_threads_TIMESTAMP.json --min-episodes 2
# Step 5: Add descriptions to 2-insight threads
modal run add_thread_descriptions.py
# Step 6: Create final curated export
python create_final_export.py
# (Optional) Enrich existing data with video URLs
python enrich_with_video.py --db your_database.dbNote: The --min-episodes flag (default 2) ensures only threads spanning multiple episodes qualify as "invisible threads". Single-source clusters are filtered out.
The following were removed during cleanup (2026-01-20):
cluster_insights.py- Old HDBSCAN/K-means approach (replaced by find_threads_v2.py)find_contradictions.py- v1 embedding similarity approach (replaced by find_debates.py)backend/- Old Ollama-based extraction (replaced by Modal)scripts/- Old local extraction scripts- Intermediate data files (old threads, clusters, validation)
Removed during pipeline fixes (2026-01-23):
validate_threads.py- Temporary script for validating threads (now built into pipeline)
Active Scripts:
modal_extract.py/modal_extract_pg.py- Extract insights (Lenny's / Paul Graham)find_threads_v2.py- Thread discovery (current version)find_threads.py- Legacy threading (kept for reference)name_clusters.py- Thread namingcheck_thread_quality.py- Quality validationadd_thread_descriptions.py- Add descriptions to threadscreate_final_export.py- Create curated outputcreate_clean_threads_v2.py- Thread cleanup utilitiesenrich_with_video.py- Add video URLslist_threads.py- List threads in filesmerge_pairs.py- Merge thread pairsfix_pg_threads.py- Fix Paul Graham threads
Experimental:
find_debates.py/validate_debates.py- Debate detection (high false positive rate)
Last updated: 2026-01-24 (Updated documentation for GitHub)