Skip to content

Latest commit

Β 

History

History
516 lines (405 loc) Β· 17.2 KB

File metadata and controls

516 lines (405 loc) Β· 17.2 KB

Playwright MCP Guide for Web Research

Version: 2.0 - Unified Documentation System Last Updated: 2025-09-06 Purpose: Optimize Playwright usage for web research and documentation extraction


🎯 Playwright Overview

Playwright MCP server provides web automation capabilities for research, documentation extraction, and web content analysis. It's particularly valuable for extracting comprehensive documentation, research papers, and technical content from web sources.

Current Usage Statistics

  • Current Usage: 21 tool uses (moderately utilized)
  • Usage Pattern: Documentation extraction and web research
  • Primary Value: Comprehensive content extraction and analysis

πŸ” When to Use Playwright

Primary Research Scenarios

Documentation Extraction

  • βœ… Complete Documentation: Extract full documentation from web pages
  • βœ… Research Papers: Download and analyze academic papers
  • βœ… Technical Guides: Extract comprehensive implementation guides
  • βœ… API Documentation: Get complete API references and examples

Content Analysis

  • βœ… Web Scraping: Extract structured data from websites
  • βœ… Research Synthesis: Gather information from multiple sources
  • βœ… Competitive Analysis: Analyze competitor implementations
  • βœ… Trend Research: Extract current technology trends and patterns

Development Workflow Integration

  • βœ… Pre-Implementation: Gather comprehensive requirements and examples
  • βœ… Technology Evaluation: Extract detailed comparisons and benchmarks
  • βœ… Best Practices: Collect real-world implementation patterns
  • βœ… Problem Research: Extract solutions from forums and communities

πŸ› οΈ Playwright Usage Patterns

Initialization Checklist (Required)

  • Run install once per environment boot (ensures browser availability)
  • Validate configuration (credentials/paths if needed) before first navigation

Robust Extraction Flow (Recommended)

  1. Install (once): Ensure the browser is installed/bootstrapped
  2. Navigate: Go to target URL
  3. Wait: Wait for known on-page text (e.g., a heading string) to appear
  4. Evaluate or Extract:
    • Prefer evaluating after the wait, targeting specific selectors or headings
    • If evaluate returns nothing, fall back to extracting page text content (not raw HTML) using the extraction step

Fallback Strategy

  • If DOM evaluation yields empty content, run a text-extract to capture the rendered page’s main content (headings + body) rather than HTML nodes.

Do Not Mix Engines in One Flow

  • Treat Playwright and Puppeteer as independent engines. Do not assume shared state.

Documentation Extraction Workflow

Step 1: Navigate to Target Content

// Navigate to documentation page
browser_navigate({
  url: "https://developer.android.com/guide/topics/media/mediarecorder"
})

Step 2: Extract Complete Content

// Extract comprehensive documentation
browser_extract({
  instruction: "Extract the complete MediaRecorder API documentation including all methods, parameters, examples, and best practices"
})

Step 3: Save for Offline Analysis

// Save extracted content for analysis
browser_snapshot({
  name: "android_mediarecorder_docs"
})

Research Paper Extraction

Academic Content Extraction

// Navigate to research paper
browser_navigate({
  url: "https://arxiv.org/abs/2401.12345"
})

// Extract complete paper content
browser_extract({
  instruction: "Extract the complete research paper including abstract, methodology, results, and conclusions"
})

// Save PDF for detailed analysis
browser_pdf_save({
  name: "audio_processing_research_paper"
})

Multi-Source Research

Comprehensive Research Gathering

// Research complex topics from multiple sources
const researchSources = [
  "https://developer.android.com/topic/architecture",
  "https://medium.com/androiddevelopers/modern-android-development",
  "https://proandroiddev.com/android-architecture-patterns"
];

for (const url of researchSources) {
  await browser_navigate({ url });
  const content = await browser_extract({
    instruction: "Extract key architecture patterns, benefits, and implementation considerations"
  });
  await browser_snapshot({ name: `architecture_research_${url.split('/').pop()}` });
}

πŸ“Š Playwright Effectiveness Optimization

Usage Tracking and Metrics

Current Usage Analysis

  • Primary Usage: Documentation extraction and web content analysis
  • Content Quality: High for comprehensive documentation extraction
  • Research Efficiency: Excellent for gathering information from multiple sources
  • Integration Rate: Well-integrated into research and documentation workflows

Optimization Targets

  • Content Completeness: Achieve 95%+ complete extraction of target content
  • Research Efficiency: 70%+ time savings vs. manual research
  • Source Coverage: Extract from 10+ sources per complex research topic
  • Content Quality: Maintain high accuracy and relevance of extracted content

Content Quality Assessment

Extraction Completeness Framework

  • High Completeness (8-10): 90%+ of target content extracted
  • Medium Completeness (5-7): 70-89% of target content extracted
  • Low Completeness (1-4): Less than 70% of target content extracted
  • Incomplete (0): Major content sections missing

Content Accuracy Assessment

  • High Accuracy: All extracted information correct and current
  • Medium Accuracy: Mostly correct with minor errors
  • Low Accuracy: Significant errors or outdated information
  • Unreliable: Major inaccuracies requiring manual verification

πŸ”¬ Research Integration Patterns

Pre-Implementation Research Process

Step 1: Identify Research Needs

// Define research requirements
const researchRequirements = {
  topic: "Android real-time audio processing",
  sources: ["official docs", "research papers", "implementation guides"],
  depth: "comprehensive",
  format: "technical implementation"
}

Step 2: Source Identification

// Identify high-quality sources
const researchSources = [
  {
    url: "https://developer.android.com/guide/topics/media/mediarecorder",
    type: "official documentation",
    priority: "high"
  },
  {
    url: "https://arxiv.org/search/?query=audio+processing+mobile",
    type: "academic research",
    priority: "high"
  },
  {
    url: "https://proandroiddev.com/tag/audio",
    type: "community implementation",
    priority: "medium"
  }
];

Step 3: Comprehensive Extraction

// Extract from each source
for (const source of researchSources) {
  await browser_navigate({ url: source.url });
  
  const content = await browser_extract({
    instruction: `Extract comprehensive ${source.type} content about ${researchRequirements.topic}, including code examples, best practices, and implementation details`
  });
  
  await browser_snapshot({
    name: `${researchRequirements.topic}_${source.type}_${Date.now()}`
  });
}

Step 4: Content Synthesis

// Synthesize findings across sources
const synthesis = {
  official_patterns: extractOfficialPatterns(content),
  research_findings: extractResearchFindings(content),
  implementation_examples: extractCodeExamples(content),
  best_practices: extractBestPractices(content)
};

// Generate research summary
generateResearchSummary(synthesis);

Framework-Specific Research

Kotlin Migration Research

// Research Kotlin implementation patterns
browser_navigate({
  url: "https://developer.android.com/kotlin"
});

const kotlinPatterns = await browser_extract({
  instruction: "Extract Kotlin implementation patterns, migration strategies, and best practices for Android development"
});

await browser_snapshot({ name: "kotlin_migration_patterns" });

ML Strategy Research

// Research ML implementation approaches
browser_navigate({
  url: "https://developers.google.com/ml-kit"
});

const mlPatterns = await browser_extract({
  instruction: "Extract ML implementation patterns, model optimization techniques, and performance considerations for mobile"
});

await browser_snapshot({ name: "ml_implementation_patterns" });

Performance Framework Research

// Research performance optimization techniques
browser_navigate({
  url: "https://developer.android.com/topic/performance"
});

const performanceTechniques = await browser_extract({
  instruction: "Extract performance optimization techniques, benchmarking methods, and best practices for Android apps"
});

await browser_snapshot({ name: "android_performance_optimization" });

πŸš€ Advanced Playwright Techniques

Multi-Page Research Strategies

Comprehensive Documentation Collection

// Research complex topics across multiple pages
const comprehensiveResearch = async (topic) => {
  // Find related pages
  const searchResults = await browser_navigate({
    url: `https://developer.android.com/search?q=${encodeURIComponent(topic)}`
  });
  
  const relatedPages = await browser_extract({
    instruction: "Extract links to all relevant documentation pages for this topic"
  });
  
  // Extract from each related page
  const allContent = [];
  for (const page of relatedPages.slice(0, 5)) {  // Limit to top 5
    await browser_navigate({ url: page.url });
    const content = await browser_extract({
      instruction: `Extract comprehensive information about ${topic} from this page`
    });
    allContent.push(content);
  }
  
  // Synthesize comprehensive research
  return synthesizeResearch(allContent);
};

Research Paper Analysis

// Extract and analyze research papers
const analyzeResearchPaper = async (paperUrl) => {
  await browser_navigate({ url: paperUrl });
  
  // Extract paper structure
  const paperStructure = await browser_extract({
    instruction: "Extract the paper's title, authors, abstract, methodology, results, and conclusions"
  });
  
  // Save for detailed analysis
  await browser_pdf_save({ name: "research_paper_analysis" });
  
  // Extract key findings
  const keyFindings = await browser_extract({
    instruction: "Extract the most important findings, performance metrics, and implementation insights from this research"
  });
  
  return { paperStructure, keyFindings };
};

Content Validation and Verification

Cross-Source Validation

// Validate information across multiple sources
const validateInformation = async (topic, sources) => {
  const validations = [];
  
  for (const source of sources) {
    await browser_navigate({ url: source });
    const information = await browser_extract({
      instruction: `Extract information about ${topic} and note the source credibility and publication date`
    });
    
    validations.push({
      source: source,
      information: information,
      credibility: assessCredibility(source),
      recency: assessRecency(information)
    });
  }
  
  return crossValidateInformation(validations);
};

Implementation Example Extraction

// Extract practical implementation examples
const extractImplementationExamples = async (documentationUrl) => {
  await browser_navigate({ url: documentationUrl });
  
  const examples = await browser_extract({
    instruction: "Extract all code examples, implementation patterns, and practical usage examples from this documentation"
  });
  
  // Validate examples work with current APIs
  const validatedExamples = await validateExamples(examples);
  
  return validatedExamples;
};

πŸ”§ Troubleshooting Playwright Issues

Common Issues and Solutions

Content Not Extracted Completely

  • Refine Instructions: Make extraction instructions more specific
  • Use Multiple Queries: Break complex extractions into smaller parts
  • Check Page Loading: Ensure page is fully loaded before extraction
  • Handle Dynamic Content: Wait for JavaScript-generated content

Navigation Failures

  • Check URLs: Verify URLs are correct and accessible
  • Handle Redirects: Account for URL redirects and changes
  • Network Issues: Implement retry logic for network failures
  • Access Restrictions: Handle paywalls or access restrictions

Content Quality Issues

  • Source Validation: Verify source credibility and recency
  • Cross-Reference: Validate information against multiple sources
  • Update Checks: Check for more recent versions of documentation
  • Context Preservation: Ensure extracted content maintains proper context

Performance Optimization

  • Batch Processing: Group related extractions to reduce overhead
  • Selective Extraction: Extract only needed content to improve speed
  • Caching Strategy: Cache frequently accessed content
  • Parallel Processing: Extract from multiple sources simultaneously

πŸ“Š Measuring Playwright Effectiveness

Success Metrics

Quantitative Metrics

  • Extraction Completeness: Percentage of target content successfully extracted
  • Research Efficiency: Time saved vs. manual research methods
  • Source Coverage: Number of sources analyzed per research topic
  • Content Quality: Accuracy and usefulness of extracted content

Qualitative Metrics

  • Research Depth: Ability to gather comprehensive information
  • Implementation Value: Practicality of extracted examples
  • Source Quality: Credibility and recency of information sources
  • Synthesis Quality: Effectiveness of multi-source information synthesis

Research Quality Framework

Source Credibility Hierarchy

  1. Official Documentation: Highest credibility for framework-specific information
  2. Peer-Reviewed Papers: Highest credibility for research and technical analysis
  3. Industry Blogs: Medium credibility with expert authors and companies
  4. Community Forums: Useful for practical problems and real-world experience
  5. Social Media: Lowest credibility, use for trend identification only

Content Recency Assessment

  • Current Documentation: Official docs updated within last 6 months
  • Recent Research: Papers published within last 12 months
  • Industry Content: Articles from last 3-6 months
  • Community Content: Recent discussions and solutions

πŸ“š Playwright Best Practices

Extraction Strategy Guidelines

  1. Be Specific: Use detailed, specific extraction instructions
  2. Context Matters: Include context about what and why you're extracting
  3. Quality Over Quantity: Focus on high-quality, relevant content
  4. Source Diversity: Extract from multiple types of sources
  5. Validation First: Verify source credibility before deep extraction

Research Workflow Integration

  1. Plan Extraction Scope: Define what content is needed upfront
  2. Source Prioritization: Start with highest credibility sources
  3. Iterative Refinement: Refine extraction based on initial results
  4. Content Synthesis: Combine information from multiple sources
  5. Validation and Verification: Cross-check information accuracy

Performance and Efficiency Practices

  1. Strategic Extraction: Extract only what's needed for the research goal
  2. Batch Processing: Group related extractions for efficiency
  3. Content Caching: Reuse extracted content for similar research
  4. Quality Monitoring: Track extraction quality and success rates

🎯 Playwright Integration Roadmap

Immediate Actions (Week 1-2)

  • Increase usage for comprehensive documentation extraction
  • Develop extraction templates for common research scenarios
  • Implement content validation workflows
  • Create research synthesis automation

Medium-term Goals (Month 1-3)

  • Achieve 95%+ extraction completeness for target content
  • Integrate Playwright into all major research workflows
  • Build library of successful extraction patterns
  • Automate multi-source research synthesis

Long-term Vision (Month 3+)

  • Playwright as primary tool for comprehensive web research
  • Automated research report generation from web sources
  • Advanced content analysis and pattern recognition
  • Community sharing of successful research extraction patterns

πŸ“ˆ Research Impact Tracking

Weekly Research Review

## Weekly Playwright Research Review

**Period**: [Date Range]
**Extractions Conducted**: [Number]
**Average Completeness Score**: [1-10]
**Sources Analyzed**: [Number]
**Time Saved**: [Hours]
**Research Quality Score**: [1-10]

### Top Research Findings
1. **[Topic]**: [Key insights from extracted content]
2. **[Topic]**: [Key insights from extracted content]

### Extraction Efficiency Improvements
- **Query Optimization**: [Successful extraction patterns]
- **Source Quality**: [High-value sources identified]
- **Synthesis Process**: [Improved multi-source integration]

### Next Week Focus
- **Priority Research Areas**: [Based on project needs]
- **Extraction Optimization**: [Patterns to test and refine]
- **Integration Improvements**: [Workflow enhancements]

This guide optimizes Playwright usage for web research and documentation extraction. For GitHub MCP repository management, see docs/mcp-integration/github-mcp-guide.md. For general MCP integration, see docs/mcp-integration/mcp-optimization.md.