docs: Add CRW document loader integration#3353
docs: Add CRW document loader integration#3353Recep S (us) wants to merge 2 commits intolangchain-ai:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds LangChain Python documentation for the CRW web-scraper integration, including a provider page and a document loader page with setup, usage patterns, and migration guidance.
Changes:
- Added CRW provider documentation (installation + self-hosted/cloud setup).
- Added CRW document loader documentation (modes, examples, RAG pipeline, FireCrawl migration).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/oss/python/integrations/providers/crw.mdx | New provider page documenting CRW installation and configuration plus a link to loader docs. |
| src/oss/python/integrations/document_loaders/crw.mdx | New document loader page documenting CrwLoader usage, modes, options, RAG example, and migration from FireCrawl. |
|
|
||
| ## Document loader | ||
|
|
||
| See a [usage example](/oss/integrations/document_loaders/crw). |
There was a problem hiding this comment.
The link path looks inconsistent with this doc’s location under /oss/python/... and may 404 (it omits /python). Update the URL to match the site’s routing for Python integrations (e.g., include the /python segment if that’s the established pattern in this repo).
| See a [usage example](/oss/integrations/document_loaders/crw). | |
| See a [usage example](/oss/python/integrations/document_loaders/crw). |
| description: "Integrate with the CRW document loader using LangChain Python." | ||
| --- | ||
|
|
||
| [CRW](https://github.com/us/crw) is an open-source, high-performance web scraper built for AI agents. Written in Rust, it crawls and converts any website into clean markdown with JS rendering support. Self-hosted (free) or managed cloud at [fastcrw.com](https://fastcrw.com). |
There was a problem hiding this comment.
The page text states CRW has JS rendering support and the examples include render_js, but the “JS support” column is set to ❌. Either update the table to ✅ or adjust the description/examples to avoid contradicting the feature matrix.
|
|
||
| | Class | Package | Local | Serializable | JS support | | ||
| | :--- | :--- | :---: | :---: | :---: | | ||
| | CrwLoader | [langchain-crw](https://pypi.org/project/langchain-crw/) | ✅ | ❌ | ❌ | |
There was a problem hiding this comment.
The page text states CRW has JS rendering support and the examples include render_js, but the “JS support” column is set to ❌. Either update the table to ✅ or adjust the description/examples to avoid contradicting the feature matrix.
| | CrwLoader | [langchain-crw](https://pypi.org/project/langchain-crw/) | ✅ | ❌ | ❌ | | |
| | CrwLoader | [langchain-crw](https://pypi.org/project/langchain-crw/) | ✅ | ❌ | ✅ | |
| pages = [] | ||
| for doc in loader.lazy_load(): | ||
| pages.append(doc) | ||
| if len(pages) >= 10: | ||
| # do some paged operation, e.g. | ||
| # index.upsert(pages) | ||
| pages = [] |
There was a problem hiding this comment.
This snippet references loader but doesn’t define it in the same code block, so it fails if copied as-is. Include the CrwLoader(...) initialization in this block or explicitly note that it continues from the prior example.
| Run CRW locally — no API key, no account, no limits: | ||
|
|
||
| ```bash | ||
| curl -fsSL https://raw.githubusercontent.com/us/crw/main/install.sh | bash |
There was a problem hiding this comment.
Piping a remote script directly into bash is unsafe (users can’t easily verify what they’re executing). Prefer documenting a safer flow (download to a file, inspect, then execute) or link to CRW’s official installation instructions that provide verification steps.
| curl -fsSL https://raw.githubusercontent.com/us/crw/main/install.sh | bash | |
| # Download the installer script | |
| curl -fsSL https://raw.githubusercontent.com/us/crw/main/install.sh -o crw_install.sh | |
| # (Optional) Inspect the script before running it | |
| cat crw_install.sh | |
| # Run the installer | |
| bash crw_install.sh | |
| # Start CRW |
| --- | ||
|
|
||
| >[CRW](https://github.com/us/crw) is an open-source, high-performance web scraper built for AI agents. | ||
| > Written in Rust, it crawls and converts any website into clean markdown. |
There was a problem hiding this comment.
Trailing whitespace at end of line. Remove it to avoid noisy diffs and markdown lint issues.
| > Written in Rust, it crawls and converts any website into clean markdown. | |
| > Written in Rust, it crawls and converts any website into clean markdown. |
Summary
Adds documentation for CRW, an open-source web scraper built for AI agents, as a LangChain document loader integration.
Files added:
src/oss/python/integrations/providers/crw.mdx— Provider page with installation and setupsrc/oss/python/integrations/document_loaders/crw.mdx— Document loader page with usage examplesPyPI package:
langchain-crwWhat CRW provides
CrwLoader— Document loader with three modes:scrape,crawl, andmapFireCrawlLoaderDocumentation follows existing patterns
Modeled after the existing FireCrawl integration docs, including: