Skip to content

Latest commit

Β 

History

History
301 lines (218 loc) Β· 10.2 KB

File metadata and controls

301 lines (218 loc) Β· 10.2 KB

🌐 Web-Use

License Python Powered by CDP
Follow on Twitter Join us on Discord

Web-Use is an intelligent autonomous browsing agent, built to seamlessly navigate websites, interact with dynamic content, perform smart searches, download files, and adapt to ever-changing pages β€” all with minimal effort from you. Powered by advanced LLMs and the Chrome DevTools Protocol, it transforms complex web tasks into streamlined, automated workflows that boost productivity and save time.

✨ Key Features

  • πŸ€– Autonomous Web Navigation β€” Navigate websites, fill forms, and interact with dynamic content without manual intervention
  • πŸ› οΈ Multi-LLM Support β€” Works with Anthropic Claude, Google Gemini, OpenAI, Groq, Ollama, Cerebras, Mistral, and more
  • πŸ“Έ Vision Capability β€” Understands visual content on pages with scroll-aware bounding boxes for accurate element highlighting
  • 🌳 Semantic Tree β€” DOM traversal-based tree showing real page structure with roles, ids, classes, and text content
  • πŸ”— Web Model Context Protocol (WebMCP) β€” Discovers and uses custom tools exposed by websites
  • ⚑ Efficient Element Interaction β€” Indexed DOM elements for fast, accurate clicking and typing
  • πŸ“₯ File Operations β€” Download files and upload content to forms
  • πŸ”„ State Awareness β€” Maintains understanding of page state to avoid loops and recover from errors
  • ⏱️ Intelligent Waiting β€” Handles loading states, animations, and user interactions (CAPTCHA, OTP)
  • πŸ” OAuth 2.0 + PKCE β€” Built-in authenticated workflows for OAuth-protected services with persistent token storage

🌳 Semantic Tree

Web-Use builds a semantic tree of the visible page directly from the real DOM parent-child relationships captured via CDP β€” not reconstructed from XPaths. This gives the agent accurate structural context around every element.

Each node in the tree is rendered with CSS selector notation showing tag, id, class, and role:

document  [role: document]
└── nav#main-nav.navbar
    β”œβ”€β”€ [#0] a.nav-link "Home"  β†’ /
    β”œβ”€β”€ [#1] a.nav-link "About"  β†’ /about
    └── [#2] div.dropdown [button] "Products"
form#checkout-form
β”œβ”€β”€ p.hint  "Fill in your details below"
β”œβ”€β”€ [#3] input#email.form-input "Email"
β”œβ”€β”€ [#4] input#name.form-input "Name"
└── [#5] div.btn-group [button] "Submit"

What's included:

  • Interactive elements β€” buttons, links, inputs, selects, checkboxes, anything clickable β€” labelled [#id]
  • Informative elements β€” headings, paragraphs, list items, labels, table cells, blockquotes, figcaptions, and more
  • Structural containers β€” nav, header, footer, main, section, form, ul, aside, dialog, etc. shown as grouping context
  • Roles shown in [brackets] when they differ from the tag (e.g. div [button], span [link])
  • Text content extracted correctly even when wrapped in inline elements (em, strong, span, a, etc.)

πŸ” OAuth 2.0 + PKCE

Web-Use has built-in support for OAuth 2.0 Authorization Code flow with PKCE, enabling the agent to authenticate with any OAuth provider (Google, GitHub, Microsoft, etc.) without storing passwords.

How it works

  1. A local HTTP server starts on localhost:PORT
  2. The browser navigates to the provider's login page
  3. The user logs in once β€” the provider redirects back with an authorization code
  4. The code is exchanged for tokens using the PKCE verifier
  5. Authorization: Bearer <token> is injected into every browser request automatically
  6. Tokens are saved to ~/.web-use/oauth/ and reloaded on future runs β€” no login required again

Usage

import asyncio
import os
from src.agent.auth import OAuthConfig

oauth_config = OAuthConfig(
    client_id=os.getenv('OAUTH_CLIENT_ID'),
    auth_url='https://accounts.google.com/o/oauth2/v2/auth',
    token_url='https://oauth2.googleapis.com/token',
    scopes=['openid', 'email', 'profile'],
    redirect_uri='http://localhost:8765/callback',
)

async def setup_auth():
    await agent.browser.ensure_open()
    # Load saved token (silently refreshes if expired)
    token = await agent.browser.oauth.load(oauth_config)
    if token is None:
        # First run β€” opens login page, user authenticates once
        token = await agent.browser.oauth.authorize(oauth_config)

asyncio.run(setup_auth())

First run: login page opens, user authenticates, token saved.
Every run after: token loaded from disk, refreshed silently if needed β€” no user interaction.

To clear saved tokens:

await agent.browser.oauth.revoke()

🌐 Web Model Context Protocol (WebMCP)

Web-Use supports WebMCP, a protocol that allows websites to expose custom tools and capabilities directly to the agent. When visiting a website with WebMCP support:

  • Auto-Discovery β€” The agent automatically detects available tools
  • Dynamic Registration β€” Tools are added to the agent's toolkit on-the-fly
  • Full Integration β€” WebMCP tools appear in the browser state with complete schema information
  • Seamless Execution β€” Tools are called like built-in tools with proper parameter validation

Example

If you visit a documentation site that supports WebMCP with a search_docs tool:

**WebMCP Tools Available:**
**search_docs** β€” Search documentation
  - `query` (string) [βœ“ required]
  - `limit` (integer) [β—‹ optional]

The agent will automatically use this tool when relevant to the task.

Enable WebMCP support:

agent = Agent(
    config=config,
    llm=llm,
    use_web_mcp=True,
    max_steps=100
)

πŸ› οΈ Installation Guide

Prerequisites

  • Python 3.11 or higher
  • UV

Installation Steps

Clone the repository:

git clone https://github.com/CursorTouch/Web-Use.git
cd Web-Use

Install dependencies:

uv sync

Setting up the .env file:

GOOGLE_API_KEY="<API_KEY_HERE>"

Basic Setup:

from src.agent.browser.config import BrowserConfig
from src.providers.ollama import ChatOllama
from src.agent import Agent
from dotenv import load_dotenv

load_dotenv()

llm = ChatOllama(model='qwen3.5:397b-cloud', temperature=0.5)

config = BrowserConfig(
    browser='chrome',
    headless=False,
    use_system_profile=True
)

agent = Agent(
    config=config,
    llm=llm,
    use_vision=True,
    use_web_mcp=True,
    max_steps=100
)

user_query = input('Enter your query: ')
agent.print_response(user_query)

Execute:

uv run main.py

βš™οΈ Configuration Options

Agent Parameters

Parameter Type Default Description
config BrowserConfig Required Browser configuration
llm BaseChatLLM Required Language model for reasoning
use_vision bool False Enable screenshot-based visual understanding
use_web_mcp bool False Enable WebMCP tool discovery
max_steps int 25 Maximum actions before timeout
max_consecutive_failures int 3 Retry limit for failed tool calls
include_human_in_loop bool False Allow pausing for human input
keep_alive bool False Keep browser open after task completion

Browser Configuration

config = BrowserConfig(
    browser='chrome',               # 'chrome' or 'edge'
    headless=False,                 # Run in headless mode
    use_system_profile=True,        # Use real browser profile with auth
    user_data_dir='/path/to/profile',   # Custom profile directory
    cdp_port=9222,                  # Chrome DevTools Protocol port
    downloads_dir='/Downloads',     # Where to save files
    attach_to_existing=False,       # Connect to running browser
    update_cdp=False,               # Regenerate CDP protocol files
)

OAuth Configuration

from src.agent.auth import OAuthConfig

config = OAuthConfig(
    client_id='your-client-id',     # From your OAuth app registration
    auth_url='https://...',         # Provider authorization endpoint
    token_url='https://...',        # Provider token endpoint
    scopes=['openid', 'email'],     # Requested OAuth scopes
    redirect_uri='http://localhost:8765/callback',  # Must match app registration
    client_secret=None,             # Optional β€” not needed with PKCE
)

πŸŽ₯ Demos

Prompt: I want to know the price details of the RTX 4060 laptop gpu from various sellers from amazon.in

Amazon.mov

Prompt: Make a twitter post about AI on X

Twitter.mov

Prompt: Can you play the trailer of GTA 6 on youtube

Youtube.mov

Prompt: Can you go to my github account and visit the Windows MCP

Github.mov

πŸͺͺ License

This project is licensed under MIT License - see the LICENSE file for details.

🀝 Contributing

Contributions are welcome! Please see CONTRIBUTING for setup instructions and development guidelines.

Made with ❀️ by Jeomon George, Muhammad Yaseen


πŸ“’ References