This guide provides instructions for setting up a development environment, running tests, and contributing to the scraper project.
- Python 3.11 or higher
- Docker and Docker Compose (for integration testing)
- Git
- Clone the repository:
git clone https://github.com/spiralhouse/scraper.git
cd scraper- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install development dependencies:
pip install -r requirements-dev.txt
pip install -r requirements.txtTo run all unit tests:
pytestTo run tests with coverage reporting:
pytest --cov=scraper --cov-report=term-missingTo run a specific test file:
pytest tests/test_crawler.pyThe project includes a Docker-based test environment that generates a controlled website for testing.
- Generate the test site:
python generate_test_site.py- Start the test environment:
docker-compose up -d- Run the scraper against the test site:
python main.py http://localhost:8080 --depth 2- Stop the test environment when done:
docker-compose downIf Docker is unavailable, you can use the Python-based test server:
python serve_test_site.pyThis will start a local HTTP server on port 8080 serving the same test site.
To check code quality with flake8:
flake8 scraper testsTo run type checking with mypy:
mypy scraperTo format code with black:
black scraper testsTo enable verbose logging:
python main.py https://example.com -vTo profile the crawler's performance:
python -m cProfile -o crawler.prof main.py https://example.com --depth 1
python -c "import pstats; p = pstats.Stats('crawler.prof'); p.sort_stats('cumtime').print_stats(30)"Current test coverage is monitored through CI and displayed as a badge in the README. To increase coverage:
- Check current coverage gaps:
pytest --cov=scraper --cov-report=term-missing- Target untested functions or code paths with new tests
- Verify coverage improvement after adding tests
scraper/ # Main package directory
├── __init__.py # Package initialization
├── cache_manager.py # Cache implementation
├── callbacks.py # Callback functions for crawled pages
├── crawler.py # Main crawler class
├── request_handler.py # HTTP request/response handling
├── response_parser.py # HTML parsing and link extraction
├── robots_parser.py # robots.txt parsing and checking
└── sitemap_parser.py # sitemap.xml parsing
tests/ # Test suite
├── __init__.py
├── conftest.py # pytest fixtures
├── test_cache.py # Tests for cache_manager.py
├── test_crawler.py # Tests for crawler.py
├── test_request_handler.py
├── test_response_parser.py
├── test_robots_parser.py
└── test_sitemap_parser.py
docs/ # Documentation
├── project.md # Project overview and features
└── develop.md # Development guide
.github/workflows/ # CI configuration
- Create a new branch for your feature or bugfix
- Implement your changes with appropriate tests
- Ensure all tests pass and coverage doesn't decrease
- Submit a pull request with a clear description of the changes
- Follow PEP 8 style guidelines
- Include docstrings for all functions, classes, and modules
- Add type hints to function signatures
- Keep functions focused on a single responsibility
- Write tests for all new functionality