web-content-extraction

Here are 2 public repositories matching this topic...

Murrough-Foley / web-content-extraction-benchmark

WCXB: Web Content Extraction Benchmark — 2,008 pages, 7 page types, 1,613 domains. The largest open benchmark for web content extraction, boilerplate removal, and main content detection.

nlp benchmark text-extraction dataset web-scraping html-parsing content-extraction boilerplate-removal web-content-extraction

Updated Apr 4, 2026
Python

facsimiles / beautifulsoup

Star

🌐 BeautifulSoup: Effortlessly scrape and parse web data with this powerful Python library! Perfect for developers needing quick and reliable HTML/XML data extraction. Start saving time on your projects today! [MIRROR][UNOFFICIAL]

python data-mining mirror web-crawler python3 unofficial web-scraping xpath data-extraction html-parsing css-selectors web-automation mirrored-repository unofficial-mirror dynamic-web-scraping api-scraping web-content-extraction

Updated Mar 1, 2026
Python

Improve this page

Add a description, image, and links to the web-content-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the web-content-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly