WCXB: Web Content Extraction Benchmark — 2,008 pages, 7 page types, 1,613 domains. The largest open benchmark for web content extraction, boilerplate removal, and main content detection.
-
Updated
Apr 4, 2026 - Python
WCXB: Web Content Extraction Benchmark — 2,008 pages, 7 page types, 1,613 domains. The largest open benchmark for web content extraction, boilerplate removal, and main content detection.
🌐 BeautifulSoup: Effortlessly scrape and parse web data with this powerful Python library! Perfect for developers needing quick and reliable HTML/XML data extraction. Start saving time on your projects today! [MIRROR][UNOFFICIAL]
Add a description, image, and links to the web-content-extraction topic page so that developers can more easily learn about it.
To associate your repository with the web-content-extraction topic, visit your repo's landing page and select "manage topics."