Development Guide

← Back to README

This guide provides instructions for setting up a development environment, running tests, and contributing to the scraper project.

Setting Up a Development Environment

Prerequisites

Python 3.11 or higher
Docker and Docker Compose (for integration testing)
Git

Initial Setup

Clone the repository:

git clone https://github.com/spiralhouse/scraper.git
cd scraper

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install development dependencies:

pip install -r requirements-dev.txt
pip install -r requirements.txt

Running Tests

Unit Tests

To run all unit tests:

pytest

To run tests with coverage reporting:

pytest --cov=scraper --cov-report=term-missing

To run a specific test file:

pytest tests/test_crawler.py

Integration Tests

The project includes a Docker-based test environment that generates a controlled website for testing.

Generate the test site:

python generate_test_site.py

Start the test environment:

docker-compose up -d

Run the scraper against the test site:

python main.py http://localhost:8080 --depth 2

Stop the test environment when done:

docker-compose down

Alternative Test Server

If Docker is unavailable, you can use the Python-based test server:

python serve_test_site.py

This will start a local HTTP server on port 8080 serving the same test site.

Code Quality Tools

Linting

To check code quality with flake8:

flake8 scraper tests

Type Checking

To run type checking with mypy:

mypy scraper

Code Formatting

To format code with black:

black scraper tests

Debugging

Verbose Output

To enable verbose logging:

python main.py https://example.com -v

Profiling

To profile the crawler's performance:

python -m cProfile -o crawler.prof main.py https://example.com --depth 1
python -c "import pstats; p = pstats.Stats('crawler.prof'); p.sort_stats('cumtime').print_stats(30)"

Test Coverage

Current test coverage is monitored through CI and displayed as a badge in the README. To increase coverage:

Check current coverage gaps:

pytest --cov=scraper --cov-report=term-missing

Target untested functions or code paths with new tests
Verify coverage improvement after adding tests

Project Structure

scraper/                 # Main package directory
├── __init__.py          # Package initialization
├── cache_manager.py     # Cache implementation
├── callbacks.py         # Callback functions for crawled pages
├── crawler.py           # Main crawler class
├── request_handler.py   # HTTP request/response handling
├── response_parser.py   # HTML parsing and link extraction
├── robots_parser.py     # robots.txt parsing and checking
└── sitemap_parser.py    # sitemap.xml parsing

tests/                   # Test suite
├── __init__.py
├── conftest.py          # pytest fixtures
├── test_cache.py        # Tests for cache_manager.py
├── test_crawler.py      # Tests for crawler.py
├── test_request_handler.py
├── test_response_parser.py
├── test_robots_parser.py
└── test_sitemap_parser.py

docs/                    # Documentation
├── project.md           # Project overview and features
└── develop.md           # Development guide

.github/workflows/       # CI configuration

Contributing

Pull Request Process

Create a new branch for your feature or bugfix
Implement your changes with appropriate tests
Ensure all tests pass and coverage doesn't decrease
Submit a pull request with a clear description of the changes

Coding Standards

Follow PEP 8 style guidelines
Include docstrings for all functions, classes, and modules
Add type hints to function signatures
Keep functions focused on a single responsibility
Write tests for all new functionality

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Guide

Setting Up a Development Environment

Prerequisites

Initial Setup

Running Tests

Unit Tests

Integration Tests

Alternative Test Server

Code Quality Tools

Linting

Type Checking

Code Formatting

Debugging

Verbose Output

Profiling

Test Coverage

Project Structure

Contributing

Pull Request Process

Coding Standards

FilesExpand file tree

develop.md

Latest commit

History

develop.md

File metadata and controls

Development Guide

Setting Up a Development Environment

Prerequisites

Initial Setup

Running Tests

Unit Tests

Integration Tests

Alternative Test Server

Code Quality Tools

Linting

Type Checking

Code Formatting

Debugging

Verbose Output

Profiling

Test Coverage

Project Structure

Contributing

Pull Request Process

Coding Standards