From 9725f8545f8d78366db1ce96444386412eac1546 Mon Sep 17 00:00:00 2001 From: Max Chis Date: Tue, 10 Feb 2026 17:59:12 -0500 Subject: [PATCH] Add docs/ folder, CLAUDE.md, and streamline README Reorganize documentation into dedicated guides (architecture, API reference, development, testing, deployment, collectors) so cross-cutting topics are discoverable without duplicating the existing per-module READMEs. Slim the root README down to a quick-start and doc index. Add CLAUDE.md for AI-assisted development context. Add .vscode/ to .gitignore. Co-Authored-By: Claude Opus 4.6 --- .gitignore | 1 + CLAUDE.md | 119 +++++++++++++++++++++ README.md | 244 ++++++++----------------------------------- docs/api.md | 233 +++++++++++++++++++++++++++++++++++++++++ docs/architecture.md | 236 +++++++++++++++++++++++++++++++++++++++++ docs/collectors.md | 145 +++++++++++++++++++++++++ docs/deployment.md | 126 ++++++++++++++++++++++ docs/development.md | 182 ++++++++++++++++++++++++++++++++ docs/testing.md | 120 +++++++++++++++++++++ 9 files changed, 1203 insertions(+), 203 deletions(-) create mode 100644 CLAUDE.md create mode 100644 docs/api.md create mode 100644 docs/architecture.md create mode 100644 docs/collectors.md create mode 100644 docs/deployment.md create mode 100644 docs/development.md create mode 100644 docs/testing.md diff --git a/.gitignore b/.gitignore index ae2ff19a..ffc47b9c 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,6 @@ .DS_Store .idea/ +.vscode/ __pycache__/ .env venv/ diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..23f1547b --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,119 @@ +# CLAUDE.md + +This file provides context for Claude Code when working in this repository. + +## Project Overview + +This is the **Source Manager** — a FastAPI application for the Police Data Accessibility Project (PDAP). It collects, enriches, and manages URLs that point to police data sources, then synchronizes validated results to the Data Sources App. + +## Tech Stack + +- **Python 3.11+** with **uv** for package management +- **FastAPI** + **uvicorn** for the web framework +- **SQLAlchemy** (async) + **asyncpg** for database access +- **PostgreSQL 15** as the database +- **Alembic** for database migrations +- **Pydantic v2** for data validation +- **Docker** for local database and testing +- **pytest** + **pytest-asyncio** for testing + +## Key Commands + +```bash +# Install dependencies +uv sync + +# Start local database +cd local_database && docker compose up -d && cd .. + +# Run the app locally +fastapi dev main.py + +# Run automated tests (requires local database running) +uv run pytest tests/automated + +# Run alembic migration tests +uv run pytest tests/alembic + +# Generate a new migration +alembic revision --autogenerate -m "Description" + +# Apply migrations manually +python apply_migrations.py +``` + +## Project Structure + +All application code lives under `src/`: + +- `src/api/` — FastAPI routers and endpoints (15 route groups, 65 endpoints) +- `src/core/` — Integration layer: `AsyncCore`, task system, logger, env var manager +- `src/db/` — Database layer: async client, SQLAlchemy models, queries, DTOs +- `src/collectors/` — Pluggable URL collectors (Common Crawler, Auto-Googler, CKAN, MuckRock) +- `src/external/` — External service clients (HuggingFace, PDAP API, Internet Archive) +- `src/security/` — JWT auth via tokens from the Data Sources App +- `src/util/` — Shared helper functions + +## Architecture Patterns + +### API Endpoint Convention + +Each endpoint group follows this layout: + +``` +src/api/endpoints// +├── routes.py # APIRouter with all routes +├── get/ post/ put/ delete/ +│ ├── __init__.py # Endpoint handler +│ ├── query.py # Database query logic +│ └── dto.py # Request/response Pydantic models +└── _shared/ # Shared logic across methods +``` + +### Dependency Injection + +The app uses FastAPI's `app.state` to share core dependencies: +- `app.state.async_core` — `AsyncCore` instance (main facade) +- `app.state.async_scheduled_task_manager` — scheduled task manager +- `app.state.logger` — `AsyncCoreLogger` instance + +### Collector Pattern + +All collectors inherit from `AsyncCollectorBase` and are registered in `src/collectors/mapping.py`. Each must implement `run_implementation()` and specify a `preprocessor` class. + +### Task System + +- **URL tasks** enrich individual URLs (HTML scraping, agency ID, record type classification, etc.). Operators live in `src/core/tasks/url/operators/`. +- **Scheduled tasks** handle system-wide operations (sync to DS App, cleanup, HuggingFace upload, etc.). Implementations live in `src/core/tasks/scheduled/impl/`. + +## Testing + +- **Automated tests** (`tests/automated/`) — run in CI, no third-party API calls. +- **Alembic tests** (`tests/alembic/`) — validate migration scripts. +- **Manual tests** (`tests/manual/`) — involve third-party APIs, run individually. Directory lacks `test` prefix intentionally. +- Async mode is `auto` — async test functions are detected automatically. +- Test timeout is 300 seconds. +- Fixtures in `tests/conftest.py` provide `adb_client` and `db_client`. +- Test data helpers are in `tests/helpers/data_creator/`. + +## Environment Variables + +See `ENV.md` for the full reference. Key categories: +- `POSTGRES_*` — database connection +- `DS_APP_SECRET_KEY` — JWT validation (must match the Data Sources App) +- Various API keys (Google, HuggingFace, PDAP, Discord, etc.) +- Feature flags — all tasks can be individually toggled (set to `0` to disable) + +## Database + +- Managed with Alembic. Migrations live in `alembic/versions/`. +- Models are in `src/db/models/impl/` organized by entity. +- The primary interface is `AsyncDatabaseClient` in `src/db/client/async_.py`. +- Local database uses Docker: `local_database/docker-compose.yml`. + +## Important Notes + +- The app exposes its API docs at `/api` (not the default `/docs` — `/docs` redirects to `/api`). +- CORS is configured for `localhost:8888`, `pdap.io`, and `pdap.dev`. +- Two permission levels: `access_source_collector` (general) and `source_collector_final_review` (final review). +- The app synchronizes agencies, data sources, and meta URLs to the Data Sources App via nine scheduled sync tasks. diff --git a/README.md b/README.md index 56e8182d..2473743f 100644 --- a/README.md +++ b/README.md @@ -1,226 +1,64 @@ -This is a multi-language repo containing scripts or tools for identifying and cataloguing Data Sources based on their URL and HTML content. - -# Index - -name | description of purpose ---- | --- -.github/workflows | Scheduling and automation -agency_identifier | Matches URLs with an agency from the PDAP database -annotation_pipeline | Automated pipeline for generating training data in our ML data source identification models. Manages common crawl, HTML tag collection, and Label Studio import/export -html_tag_collector | Collects HTML header, meta, and title tags and appends them to a JSON file. The idea is to make a richer dataset for algorithm training and data labeling. -identification_pipeline.py | The core python script uniting this modular pipeline. More details below. -llm_api_logic | Scripts for accessing the openai API on PDAP's shared account -source_collectors| Tools for extracting metadata from different sources, including CKAN data portals and Common Crawler -collector_db | Database for storing data from source collectors -collector_manager | A module which provides a unified interface for interacting with source collectors and relevant data -core | A module which integrates other components, such as collector_manager and collector_db -api | API for interacting with collector_manager, core, and collector_db -local_database | Resources for setting up a test database for local development -security_manager| A module which provides a unified interface for interacting with authentication and authorization | -tests | Unit and integration tests | -util | various utility functions | - -## Installation +# Data Source Manager -``` -uv sync -``` - -## How to use - -1. Create an .env file in this directory following the instructions in `ENV.md` - 1. If necessary, start up the database using `docker compose up -d` while in the `local_database` directory -2. Run `fastapi dev main.py` to start up the fast API server -3. In a browser, navigate to `http://localhost:8000/docs` to see the full list of API endpoints - -Note that to access API endpoints, you will need to have a valid Bearer Token from the Data Sources API at `https://data-sources.pdap.io/api` - -# Contributing - -Thank you for your interest in contributing to this project! Please follow these guidelines: - -- [These Design Principles](https://github.com/Police-Data-Accessibility-Project/meta/blob/main/DESIGN-PRINCIPLES.md) may be used to make decisions or guide your work. -- If you want to work on something, create an issue first so the broader community can discuss it. -- If you make a utility, script, app, or other useful bit of code: put it in a top-level directory with an appropriate name and dedicated README and add it to the index. - -# Testing - -Note that prior to running tests, you need to install [Docker](https://docs.docker.com/get-started/get-docker/) and have the Docker engine running. +A FastAPI application for identifying and cataloguing police data sources. Part of the [Police Data Accessibility Project](https://pdap.io) (PDAP). -Tests can be run by spinning up the `docker-compose-test.yml` file in the root directory. This will start a two-container setup, consisting of the FastAPI Web App and a clean Postgres Database. +The Source Manager collects URLs from various sources, enriches them with metadata using automated tasks and ML models, supports human annotation for validation, and synchronizes approved data sources to the [Data Sources App](https://data-sources.pdap.io/api). -This can be done via the following command: +## Quick Start ```bash -docker compose up -d -``` - -Following that, you will need to set up the uvicorn server using the following command: - -```bash -docker exec data-source-identification-app-1 uvicorn api.main:app --host 0.0.0.0 --port 80 -``` - -Note that while the container may mention the web app running on `0.0.0.0:8000`, the actual host may be `127.0.0.1:8000`. - -To access the API documentation, visit `http://{host}:8000/docs`. +# Install dependencies +uv sync -To run tests on the container, run: +# Start the local database +cd local_database && docker compose up -d && cd .. -```bash -docker exec data-source-identification-app-1 pytest /app/tests/automated -``` +# Create a .env file (see ENV.md for all variables) +# At minimum, set the POSTGRES_* variables to match local_database defaults. -Be sure to inspect the `docker-compose.yml` file in the root directory -- some environment variables are dependant upon the Operating System you are using. - -# Diagrams - -## Identification pipeline plan - -```mermaid -flowchart TD - SourceCollectors["**Source Collectors:** scripts for creating batches of potentially useful URLs using different strategies"] - Identifier["Batches are prepared for labeling by automatically collecting metadata and identifying low-hanging fruit properties"] - SourceCollectorLabeling["Human labeling of missing or uncertain metadata takes place in Source Collector Retool app"] - SourceCollectorReview["Human Final Review of the labeled sources, for submission or discard, in Retool"] - API["Submitting sources to the Data Sources API when they are Relevant and have an **Agency, Record Type, and Name**"] - - SourceCollectors --> Identifier - Identifier --> SourceCollectorLabeling - SourceCollectorLabeling --> SourceCollectorReview - SourceCollectorReview --> API - API --> Search["Allowing users to search for data and browse maps"] - Search --> Sentiment["Capturing user sentiment and overall database utility"] - API --> MLModels["Improving ML metadata labelers: relevance, agency, record type, etc"] - API --> Missingness["Documenting data we have searched for and found to be missing"] - Missingness --> Maps["Mapping our progress and the overall state of data access"] - - %% Default class for black stroke - classDef default fill:#fffbfa,stroke:#000,stroke-width:1px,color:#000; - - %% Custom styles - class API gold; - class Search lightgold; - class MLModels,Missingness lightergold; - - %% Define specific classes - classDef gray fill:#bfc0c0 - classDef gold fill:#d5a23c - classDef lightgold fill:#fbd597 - classDef lightergold fill:#fdf0dd - classDef byzantium fill:#dfd6de +# Run the app +fastapi dev main.py ``` -## Training models by batching and annotating URLs - -```mermaid -%% Here's a guide to mermaid syntax: https://mermaid.js.org/syntax/flowchart.html +Then open `http://localhost:8000/api` for the interactive API docs. -sequenceDiagram +Note: accessing API endpoints requires a valid Bearer token from the Data Sources API. -participant HF as Hugging Face -participant GH as GitHub -participant SC as Source Collector app -participant PDAP as PDAP API +## Documentation -loop create batches of URLs
for human labeling - SC ->> SC: Crawl for a new batch
of URLs with common_crawler
or other methods - SC ->> SC: Add metadata to each batch
with source_tag_collector - SC ->> SC: Add labeling tasks in
the Source Collector app +| Document | Description | +|----------|-------------| +| [Architecture](docs/architecture.md) | System design, module structure, task system, data flow | +| [API Reference](docs/api.md) | All 65 endpoints across 15 route groups | +| [Development Guide](docs/development.md) | Local setup, environment variables, common workflows | +| [Testing Guide](docs/testing.md) | Running tests, CI pipeline, writing new tests | +| [Deployment](docs/deployment.md) | Docker, Alembic migrations, DS App synchronization | +| [Collectors](docs/collectors.md) | Collector architecture and how to build new ones | +| [Environment Variables](ENV.md) | Full reference for all env vars and feature flags | -loop annotate URLs - SC ->> SC: Users label using
Retool interface - SC ->> SC: Reviewers finalize
and submit labels -end +## Project Structure -loop update training data
with new annotations - SC ->> SC: Check for completed
annotation tasks - SC -->> PDAP: Submit labeled URLs to the app - SC ->> HF: Write all annotations to
training-urls dataset - SC ->> SC: maintain batch status -end - -loop model training - HF ->> HF: retrain ML models with
updated data using
trainer in hugging_face -end - -end ``` - -# Docstring and Type Checking - -Docstrings and Type Checking are checked using the [pydocstyle](https://www.pydocstyle.org/en/stable/) and [mypy](https://mypy-lang.org/) -modules, respectively. When making a pull request, a Github Action (`python_checks.yml`) will run and, -if it detects any missing docstrings or type hints in files that you have modified, post them in the Pull Request. - -These will *not* block any Pull request, but exist primarily as advisory comments to encourage good coding standards. - -Note that `python_checks.yml` will only function on pull requests made from within the repo, not from a forked repo. - -# Syncing to Data Sources App - -The Source Manager (SM) is part of a two app system, with the other app being the Data Sources (DS) App. - - -## Add, Update, and Delete - -These are the core synchronization actions. - -In order to propagate changes to DS, we synchronize additions, updates, and deletions of the following entities: -- Agencies -- Data Sources -- Meta URLs - -Each action for each entity occurs through a separate task. At the moment, there are nine tasks total. - -Each task gathers requisite information from the SM database and sends a request to one of nine corresponding endpoints in the DS API. - -Each DS endpoint follows the following format: - -```text -/v3/sync/{entity}/{action} +src/ +├── api/ # FastAPI routers and endpoint logic +├── core/ # Integration layer and task system +├── db/ # SQLAlchemy models, async DB client, queries +├── collectors/ # Pluggable URL collection strategies +├── external/ # Clients for external services (HuggingFace, PDAP, etc.) +├── security/ # JWT auth and permissions +└── util/ # Shared helpers ``` -Synchronizations are designed to occur on an hourly basis. - -Here is a high-level description of how each action works: +## Contributing -### Add - -Adds the given entities to DS. - -These are denoted with the `/{entity}/add` path in the DS API. - -When an entity is added, it returns a unique DS ID that is mapped to the internal SM database ID via the DS app link tables. - -For an entity to be added, it must meet preconditions which are distinct for each entity: -- Agencies: Must have an agency entry in the database and be linked to a location. -- Data Sources: Must be a URL that has been internally validated as a data source and linked to an agency. -- Meta URLs: Must be a URL that has been internally validated as a meta URL and linked to an agency. - -### Update - -Updates the given entities in DS. - -These are denoted with the `/{entity}/update` path in the DS API. - -These consist of submitting the updated entities (in full) to the requisite endpoint, and updating the local app link to indicate that the update occurred. All updates are designed to be full overwrites of the entity. - -For an entity to be updated, it must meet preconditions which are distinct for each entity: -- Agencies: Must have either an agency row updated or an agency/location link updated or deleted. -- Data Sources: One of the following must be updated: - - The URL table - - The record type table - - The optional data sources metadata table - - The agency link table (either an addition or deletion) -- Meta URLs: Must be a URL that has been internally validated as a meta URL and linked to an agency. Either the URL table or the agency link table (addition or deletion) must be updated. - -### Delete +Thank you for your interest in contributing to this project! Please follow these guidelines: -Deletes the given entities from DS. +- [These Design Principles](https://github.com/Police-Data-Accessibility-Project/meta/blob/main/DESIGN-PRINCIPLES.md) may be used to make decisions or guide your work. +- If you want to work on something, create an issue first so the broader community can discuss it. +- If you make a utility, script, app, or other useful bit of code: put it in a top-level directory with an appropriate name and dedicated README and add it to the index. -These are denoted with the `/{entity}/delete` path in the DS API. +## Code Quality -This consists of submitting a set of DS IDs to the requisite endpoint, and removing the associated DS app link entry in the SM database. +Docstrings and type hints are checked via a GitHub Action (`python_checks.yml`) using [pydocstyle](https://www.pydocstyle.org/en/stable/) and [mypy](https://mypy-lang.org/). These produce advisory PR comments and do *not* block merges. -When an entity with a corresponding DS App Link is deleted from the Source Manager, the core data is removed but a deletion flag is appended to the DS App Link entry, indicating that the entry is not yet removed from the DS App. The deletion task uses this flag to identify entities to be deleted, submits the deletion request to the DS API, and removes both the flag and the DS App Link. \ No newline at end of file +Note: `python_checks.yml` only runs on pull requests from within the repo, not from forks. diff --git a/docs/api.md b/docs/api.md new file mode 100644 index 00000000..18054221 --- /dev/null +++ b/docs/api.md @@ -0,0 +1,233 @@ +# API Reference + +The Source Manager exposes a FastAPI application with 15 route groups and 65 endpoints. The interactive API docs are available at `/api` when the server is running. + +## Authentication + +Most endpoints require a valid Bearer token from the [Data Sources App](https://data-sources.pdap.io/api). Tokens are JWTs validated against the `DS_APP_SECRET_KEY` environment variable. + +Two permission levels are used: + +| Permission | Description | +|------------|-------------| +| `access_source_collector` | General access (required for most endpoints) | +| `source_collector_final_review` | Final review of annotations | + +Include the token in the `Authorization` header: + +``` +Authorization: Bearer +``` + +## Endpoints + +### Root (`/`) + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/` | Health check — returns "Hello World" | + +--- + +### Agencies (`/agencies`) + +Manage agencies and their location associations. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/agencies` | List agencies (paginated) | +| POST | `/agencies` | Create a new agency | +| DELETE | `/agencies/{agency_id}` | Delete an agency | +| PUT | `/agencies/{agency_id}` | Update an agency | +| GET | `/agencies/{agency_id}/locations` | List locations for an agency | +| POST | `/agencies/{agency_id}/locations/{location_id}` | Link a location to an agency | +| DELETE | `/agencies/{agency_id}/locations/{location_id}` | Unlink a location from an agency | + +--- + +### Annotate (`/annotate`) + +Annotation workflows for labeling URLs — both anonymous and authenticated. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/annotate/anonymous` | Get next URL for anonymous annotation | +| POST | `/annotate/anonymous/{url_id}` | Submit anonymous annotation, get next URL | +| GET | `/annotate/all` | Get next URL for authenticated annotation (optional `batch_id`, `url_id` filters) | +| POST | `/annotate/all/{url_id}` | Submit authenticated annotation, get next URL | +| GET | `/annotate/suggestions/agencies/{url_id}` | Get agency suggestions for a URL | + +--- + +### Batch (`/batch`) + +View and manage URL collection batches. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/batch` | List batch summaries (filterable by collector type, status) | +| GET | `/batch/{batch_id}` | Get batch details | +| GET | `/batch/{batch_id}/urls` | List URLs in a batch (paginated) | +| GET | `/batch/{batch_id}/duplicates` | List duplicate URLs in a batch (paginated) | +| GET | `/batch/{batch_id}/logs` | Get logs for a batch | +| POST | `/batch/{batch_id}/abort` | Abort a running batch | + +--- + +### Check (`/check`) + +Validation endpoints. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/check/unique-url` | Check if a URL is unique in the database | + +--- + +### Collector (`/collector`) + +Start URL collection runs. Each collector type has its own endpoint. + +| Method | Path | Description | +|--------|------|-------------| +| POST | `/collector/example` | Start the example collector | +| POST | `/collector/ckan` | Start the CKAN collector | +| POST | `/collector/common-crawler` | Start the Common Crawler collector | +| POST | `/collector/auto-googler` | Start the Auto Googler collector | +| POST | `/collector/muckrock-simple` | Start MuckRock simple search | +| POST | `/collector/muckrock-county` | Start MuckRock county-level search | +| POST | `/collector/muckrock-all` | Start MuckRock all FOIA requests | +| POST | `/collector/manual` | Upload a manual batch of URLs | + +--- + +### Contributions (`/contributions`) + +Track user annotation contributions. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/contributions/leaderboard` | Get contribution leaderboard | +| GET | `/contributions/user` | Get current user's contributions and agreement rates | + +--- + +### Data Sources (`/data-sources`) + +Manage validated data sources and their agency associations. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/data-sources` | List data sources (paginated) | +| GET | `/data-sources/{url_id}` | Get a data source by URL ID | +| PUT | `/data-sources/{url_id}` | Update a data source | +| GET | `/data-sources/{url_id}/agencies` | List agencies for a data source | +| POST | `/data-sources/{url_id}/agencies/{agency_id}` | Link an agency to a data source | +| DELETE | `/data-sources/{url_id}/agencies/{agency_id}` | Unlink an agency from a data source | + +--- + +### Locations (`/locations`) + +| Method | Path | Description | +|--------|------|-------------| +| POST | `/locations` | Create a new location | + +--- + +### Meta URLs (`/meta-urls`) + +Manage meta URLs (non-data-source URLs associated with agencies) and their agency associations. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/meta-urls` | List meta URLs (paginated) | +| PUT | `/meta-urls/{url_id}` | Update a meta URL | +| GET | `/meta-urls/{url_id}/agencies` | List agencies for a meta URL | +| POST | `/meta-urls/{url_id}/agencies/{agency_id}` | Link an agency to a meta URL | +| DELETE | `/meta-urls/{url_id}/agencies/{agency_id}` | Unlink an agency from a meta URL | + +--- + +### Metrics (`/metrics`) + +Analytics and progress tracking. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/metrics/batches/aggregated` | Aggregated batch metrics | +| GET | `/metrics/batches/breakdown` | Per-batch metrics breakdown (paginated) | +| GET | `/metrics/urls/aggregate` | Aggregated URL metrics | +| GET | `/metrics/urls/aggregate/pending` | Aggregated pending URL metrics | +| GET | `/metrics/urls/breakdown/submitted` | Submitted URLs breakdown | +| GET | `/metrics/urls/breakdown/pending` | Pending URLs breakdown | +| GET | `/metrics/backlog` | Annotation backlog metrics | + +--- + +### Search (`/search`) + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/search/url` | Search for a URL | +| GET | `/search/agency` | Search for agencies (requires at least one of: `query`, `location_id`, `jurisdiction_type`) | + +--- + +### Submit (`/submit`) + +Submit new URLs and data sources for review. + +| Method | Path | Description | +|--------|------|-------------| +| POST | `/submit/url` | Submit a URL for review | +| POST | `/submit/data-source` | Submit a data source proposal (returns 409 if duplicate) | + +--- + +### Task (`/task`) + +View task status and history. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/task` | List tasks (filterable by status and type, paginated) | +| GET | `/task/status` | Get current task processing status | +| GET | `/task/{task_id}` | Get details for a specific task | + +--- + +### URL (`/url`) + +View and manage individual URLs. + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/url` | List URLs (paginated, filterable to show errors only) | +| GET | `/url/{url_id}/screenshot` | Get screenshot for a URL (returns WebP image) | +| DELETE | `/url/{url_id}` | Delete a URL | + +## Endpoint Structure + +Each endpoint group follows a consistent directory layout: + +``` +src/api/endpoints// +├── routes.py # APIRouter definition with all routes +├── get/ # GET endpoint(s) +│ ├── __init__.py # Handler function +│ ├── query.py # Database query logic +│ └── dto.py / response.py / request.py +├── post/ # POST endpoint(s) +├── put/ # PUT endpoint(s) +├── delete/ # DELETE endpoint(s) +└── _shared/ # Shared logic across HTTP methods +``` + +## CORS + +Allowed origins: + +- `http://localhost:8888` (local development) +- `https://pdap.io` +- `https://pdap.dev` diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 00000000..d96de92d --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,236 @@ +# Architecture + +The Source Manager (SM) is a FastAPI application that identifies and catalogues police data sources. It is one half of a two-app system — the other being the [Data Sources App](https://data-sources.pdap.io/api) (DS). + +## High-Level Overview + +``` +Collectors ──> Database ──> Tasks (automated enrichment) ──> Annotation (human review) ──> Sync to DS App +``` + +1. **Collectors** gather batches of URLs from external sources (Google, CKAN portals, MuckRock, Common Crawl). +2. URLs are stored in the **database** along with batch metadata. +3. **URL tasks** automatically enrich each URL — scraping HTML, identifying agencies, classifying record types, taking screenshots, etc. +4. **Annotation** workflows let humans label and validate the automated results. +5. **Scheduled tasks** synchronize validated data to the Data Sources App and perform housekeeping. + +## Module Structure + +``` +src/ +├── api/ # FastAPI routers and endpoint logic +├── core/ # Integration layer — ties everything together +├── db/ # SQLAlchemy models, async database client, queries, DTOs +├── collectors/ # URL collection strategies (pluggable) +├── external/ # Clients for external services (HuggingFace, PDAP, Internet Archive) +├── security/ # JWT validation and permission checks +└── util/ # Shared helpers and utilities +``` + +### `api/` + +The FastAPI application with 15 routers covering 65 endpoints. Each endpoint group follows a consistent structure: + +``` +endpoints// +├── routes.py # Router definition +├── get/ post/ put/ delete/ +│ ├── __init__.py # Endpoint handler +│ ├── query.py # Database query logic +│ └── dto.py # Request/response models +└── _shared/ # Logic shared across methods +``` + +See [api.md](api.md) for the full endpoint reference. + +### `core/` + +The integration layer. Key components: + +- **`AsyncCore`** — central facade that coordinates the collector manager, task manager, and database client. Injected into API endpoints via `app.state`. +- **`AsyncCoreLogger`** — logs operations to the database. +- **`EnvVarManager`** — centralized access to environment variables. +- **Task system** — see [Task System](#task-system) below. + +### `db/` + +The database layer, built on SQLAlchemy with async support: + +- **`client/async_.py`** — `AsyncDatabaseClient`, the primary interface for all database operations. +- **`client/sync.py`** — `DatabaseClient`, used for synchronous initialization (e.g. `init_db`). +- **`models/`** — SQLAlchemy ORM models organized by entity (agency, batch, url, location, etc.), including materialized views and database views. +- **`queries/`** — Complex query builder classes. +- **`dtos/`** — Data transfer objects for passing data between layers. + +### `collectors/` + +A pluggable system for gathering URLs from different sources. Each collector extends `AsyncCollectorBase`, is registered in `COLLECTOR_MAPPING`, and runs asynchronously. See [collectors.md](collectors.md) for details. + +### `external/` + +Clients for services outside this repository: + +| Client | Purpose | +|--------|---------| +| `huggingface/inference/` | HuggingFace Inference API for ML classification | +| `huggingface/hub/` | HuggingFace Hub for uploading training datasets | +| `pdap/` | PDAP Data Sources API (using `pdap-access-manager`) | +| `internet_archives/` | Internet Archive S3 API for URL preservation | +| `url_request/` | Generic HTTP request interface for scraping | + +### `security/` + +JWT-based authentication using tokens from the Data Sources App. The `SecurityManager` validates tokens against `DS_APP_SECRET_KEY` and extracts user permissions. Two permission levels are relevant: + +- `access_source_collector` — general access to the Source Manager +- `source_collector_final_review` — permission for final review of annotations + +## Task System + +Tasks are the primary mechanism for automated data enrichment. + +### Nomenclature + +| Term | Definition | +|------|------------| +| **Collector** | A submodule for collecting URLs from a particular source | +| **Batch** | A group of URLs produced by a single collector run | +| **Cycle** | The full lifecycle of a URL — from initial retrieval to disposal or submission to DS | +| **Task** | A semi-independent operation performed on a set of URLs | +| **Task Set** | A group of URLs operated on together as part of a single task | +| **Task Operator** | A class that performs a single task on a set of URLs | +| **Subtask** | A subcomponent of a Task Operator, often distinguished by collector strategy | + +### URL Tasks + +URL tasks run against individual URLs to enrich them with metadata. They are triggered by the `RUN_URL_TASKS` scheduled task and managed by the `TaskManager`. Each task is implemented as a **Task Operator**: + +| Operator | Purpose | +|----------|---------| +| `html` | Scrapes HTML content from the URL | +| `probe` | Probes the URL for web metadata (status codes, redirects) | +| `root_url` | Extracts and links the root URL | +| `agency_identification` | Matches URLs to agencies using multiple subtask strategies | +| `location_id` | Identifies the geographic location associated with a URL | +| `record_type` | Classifies the record type using ML models | +| `auto_relevant` | Classifies whether the URL is relevant to police data | +| `auto_name` | Generates a human-readable name for the URL | +| `misc_metadata` | Extracts miscellaneous metadata | +| `screenshot` | Captures a screenshot of the URL | +| `validate` | Automatically validates URLs meeting certain criteria | +| `suspend` | Suspends URLs that meet suspension criteria | + +### Scheduled Tasks + +Scheduled tasks run on a recurring basis (typically hourly) and handle system-wide operations: + +| Task | Purpose | +|------|---------| +| Run URL Tasks | Triggers URL task processing | +| Sync to DS (9 tasks) | Synchronizes agencies, data sources, and meta URLs to the Data Sources App (add/update/delete for each) | +| Push to HuggingFace | Uploads annotation data to HuggingFace for model training | +| Internet Archive Probe/Save | Probes and saves URLs to the Internet Archive | +| Populate Backlog Snapshot | Generates a snapshot of the annotation backlog | +| Refresh Materialized Views | Refreshes database materialized views | +| Update URL Status | Updates the processing status of URLs | +| Delete Old Logs | Cleans up old log entries | +| Delete Stale Screenshots | Removes screenshots for already-validated URLs | +| Task Cleanup | Cleans up completed or stale task records | +| Integrity Monitor | Runs integrity checks on the data | + +All tasks can be individually enabled or disabled via environment variable flags. See [ENV.md](../ENV.md) for the full list. + +## Data Sources App Synchronization + +The SM synchronizes three entity types to the DS App: + +- **Agencies** +- **Data Sources** +- **Meta URLs** + +Each entity has three sync operations (add, update, delete), for nine tasks total. Each task: + +1. Queries the SM database for entities that need syncing. +2. Sends a request to the corresponding DS API endpoint (`/v3/sync/{entity}/{action}`). +3. Updates local tracking records (DS App Link tables) to reflect the sync. + +Synchronization runs hourly. For full details on preconditions and behavior, see the [Syncing to Data Sources App](../README.md#syncing-to-data-sources-app) section of the root README. + +## Identification Pipeline + +```mermaid +flowchart TD + SourceCollectors["**Source Collectors:** scripts for creating batches of potentially useful URLs using different strategies"] + Identifier["Batches are prepared for labeling by automatically collecting metadata and identifying low-hanging fruit properties"] + SourceCollectorLabeling["Human labeling of missing or uncertain metadata takes place in Source Collector Retool app"] + SourceCollectorReview["Human Final Review of the labeled sources, for submission or discard, in Retool"] + API["Submitting sources to the Data Sources API when they are Relevant and have an **Agency, Record Type, and Name**"] + + SourceCollectors --> Identifier + Identifier --> SourceCollectorLabeling + SourceCollectorLabeling --> SourceCollectorReview + SourceCollectorReview --> API + API --> Search["Allowing users to search for data and browse maps"] + Search --> Sentiment["Capturing user sentiment and overall database utility"] + API --> MLModels["Improving ML metadata labelers: relevance, agency, record type, etc"] + API --> Missingness["Documenting data we have searched for and found to be missing"] + Missingness --> Maps["Mapping our progress and the overall state of data access"] + + classDef default fill:#fffbfa,stroke:#000,stroke-width:1px,color:#000; + class API gold; + class Search lightgold; + class MLModels,Missingness lightergold; + classDef gray fill:#bfc0c0 + classDef gold fill:#d5a23c + classDef lightgold fill:#fbd597 + classDef lightergold fill:#fdf0dd + classDef byzantium fill:#dfd6de +``` + +## Training Pipeline + +```mermaid +sequenceDiagram + +participant HF as Hugging Face +participant GH as GitHub +participant SC as Source Collector app +participant PDAP as PDAP API + +loop create batches of URLs
for human labeling + SC ->> SC: Crawl for a new batch
of URLs with common_crawler
or other methods + SC ->> SC: Add metadata to each batch
with source_tag_collector + SC ->> SC: Add labeling tasks in
the Source Collector app + +loop annotate URLs + SC ->> SC: Users label using
Retool interface + SC ->> SC: Reviewers finalize
and submit labels +end + +loop update training data
with new annotations + SC ->> SC: Check for completed
annotation tasks + SC -->> PDAP: Submit labeled URLs to the app + SC ->> HF: Write all annotations to
training-urls dataset + SC ->> SC: maintain batch status +end + +loop model training + HF ->> HF: retrain ML models with
updated data using
trainer in hugging_face +end + +end +``` + +## Application Startup + +On startup (`lifespan` in `src/api/main.py`), the app initializes in this order: + +1. `EnvVarManager` and environment variables are loaded. +2. Database clients (`DatabaseClient`, `AsyncDatabaseClient`) are created and the database schema is initialized. +3. An `aiohttp.ClientSession` is created for all outbound HTTP requests. +4. External service clients are created (PDAP, HuggingFace, Internet Archive, MuckRock). +5. The `TaskManager` is built with all URL task operators. +6. The `AsyncCollectorManager` is built and linked to the task manager (so collection triggers processing). +7. `AsyncCore` is assembled from the above components. +8. The `AsyncScheduledTaskManager` is built and its jobs are registered. +9. All three shared objects (`async_core`, `async_scheduled_task_manager`, `logger`) are placed into `app.state` for dependency injection into endpoints. diff --git a/docs/collectors.md b/docs/collectors.md new file mode 100644 index 00000000..c0a1626c --- /dev/null +++ b/docs/collectors.md @@ -0,0 +1,145 @@ +# Collector System + +Collectors are pluggable modules that gather batches of URLs from external sources. Each collector runs asynchronously and feeds URLs into the pipeline for automated enrichment and human annotation. + +## Available Collectors + +| Collector | Type Key | Source | +|-----------|----------|--------| +| **Common Crawler** | `COMMON_CRAWLER` | Interfaces with the Common Crawl dataset to extract URLs | +| **Auto-Googler** | `AUTO_GOOGLER` | Automates Google Custom Search API to find URLs | +| **CKAN** | `CKAN` | Scrapes packages from CKAN open data portals | +| **MuckRock Simple Search** | `MUCKROCK_SIMPLE_SEARCH` | Searches MuckRock FOIA requests | +| **MuckRock County Search** | `MUCKROCK_COUNTY_SEARCH` | County-level MuckRock FOIA search | +| **MuckRock All FOIA** | `MUCKROCK_ALL_SEARCH` | Retrieves all MuckRock FOIA requests | +| **Example** | `EXAMPLE` | A reference implementation for development | + +Each collector has its own README in `src/collectors/impl//`. + +## Architecture + +``` +src/collectors/ +├── manager.py # AsyncCollectorManager — starts, stops, tracks collectors +├── mapping.py # COLLECTOR_MAPPING — registry of collector types to classes +├── enums.py # CollectorType enum +├── exceptions.py # Custom exceptions +├── queries/ # Collector-specific database queries +└── impl/ + ├── base.py # AsyncCollectorBase — abstract base class + ├── auto_googler/ + ├── ckan/ + ├── common_crawler/ + ├── example/ + └── muckrock/ +``` + +### AsyncCollectorManager + +The `AsyncCollectorManager` is responsible for: + +- Starting collectors as async tasks. +- Tracking running collectors by batch ID. +- Aborting collectors on request. +- Shutting down all collectors on app shutdown. + +It is initialized during app startup and linked to the task manager, so that when a collector finishes, URL processing tasks are automatically triggered. + +### AsyncCollectorBase + +All collectors inherit from `AsyncCollectorBase`. The base class handles: + +- **Lifecycle management** — timing, status tracking, error handling. +- **Logging** — writes log entries to the database via `AsyncCoreLogger`. +- **Post-processing** — after `run_implementation()` completes, the base class runs a preprocessor to normalize the output, inserts URLs into the database, updates the batch record, and triggers downstream tasks. + +The lifecycle is: + +``` +run() + ├── start_timer() + ├── run_implementation() # <-- your code here + ├── stop_timer() + ├── close() # sets status to READY_TO_LABEL + └── process() + ├── preprocessor.preprocess() # normalize output + ├── adb_client.insert_urls() # insert into DB + ├── adb_client.update_batch_post_collection() # update batch + └── post_collection_function_trigger.trigger() # trigger URL tasks +``` + +If the collector is cancelled, status is set to `ABORTED`. If an exception occurs, status is set to `ERROR`. + +## Creating a New Collector + +### 1. Create the Collector Class + +Create a new directory under `src/collectors/impl/` and implement a class that inherits from `AsyncCollectorBase`: + +```python +from src.collectors.impl.base import AsyncCollectorBase +from src.collectors.enums import CollectorType +from src.core.preprocessors.base import PreprocessorBase + +class MyCollector(AsyncCollectorBase): + collector_type = CollectorType.MY_COLLECTOR + preprocessor = MyPreprocessor # must extend PreprocessorBase + + async def run_implementation(self) -> None: + # Your collection logic here. + # Use self.dto to access input parameters. + # Use await self.log("message") to write log entries. + # Set self.data to the collected output (a Pydantic model). + pass +``` + +### 2. Create a Preprocessor + +The preprocessor converts your collector's raw output into a list of `URLInfo` objects. Create a class extending `PreprocessorBase` in `src/core/preprocessors/`: + +```python +from src.core.preprocessors.base import PreprocessorBase +from src.db.models.impl.url.core.pydantic.info import URLInfo + +class MyPreprocessor(PreprocessorBase): + def preprocess(self, data) -> list[URLInfo]: + # Convert raw data into URLInfo objects + pass +``` + +### 3. Add to the Collector Mapping + +Register your collector in `src/collectors/mapping.py`: + +```python +from src.collectors.impl.my_collector.collector import MyCollector + +COLLECTOR_MAPPING = { + # ... existing collectors ... + CollectorType.MY_COLLECTOR: MyCollector, +} +``` + +### 4. Add the Enum Value + +Add your collector type to `CollectorType` in `src/collectors/enums.py`. + +### 5. Create an API Endpoint + +Add a POST endpoint in `src/api/endpoints/collector/routes.py` to trigger your collector. Follow the pattern of existing collector endpoints. + +### 6. Create a Request DTO + +Define a Pydantic model for the collector's input parameters. This is passed as `self.dto` in the collector. + +## Batch Lifecycle + +When a collector is started via the API: + +1. A new batch record is created in the database with status `IN_PROCESS`. +2. The collector runs asynchronously. +3. On success, the batch status becomes `READY_TO_LABEL` and URL tasks are triggered. +4. On error, the batch status becomes `ERROR`. +5. On abort, the batch status becomes `ABORTED`. + +Batches can be viewed, filtered, and managed through the `/batch` API endpoints. diff --git a/docs/deployment.md b/docs/deployment.md new file mode 100644 index 00000000..eddafb96 --- /dev/null +++ b/docs/deployment.md @@ -0,0 +1,126 @@ +# Deployment & Migrations + +## Production Container + +The application is containerized using Docker and deployed to DigitalOcean. + +### Dockerfile + +The production image (`Dockerfile` in the repo root) is built from `python:3.11.9-slim` and includes: + +- **uv** for dependency management (production deps only, no dev dependencies). +- **Playwright** with Chromium for URL screenshot capture. +- **spaCy** with the `en_core_web_sm` model for NLP-based location identification. +- Application source code, Alembic migrations, and the startup script. + +The container exposes port **80**. + +### Startup Process + +The entrypoint runs `execute.sh`, which does two things: + +1. **Applies database migrations** — runs `python apply_migrations.py`, which uses Alembic to bring the database to the latest schema version. +2. **Starts the server** — runs `uvicorn src.api.main:app --host 0.0.0.0 --port 80`. + +### Environment Variables + +The production container does **not** include a `.env` file (for security). Environment variables must be provided by the hosting platform. See [ENV.md](../ENV.md) for the full list. + +## Database Migrations with Alembic + +### Overview + +[Alembic](https://alembic.sqlalchemy.org/) manages the database schema. Migration scripts live in `alembic/versions/` and are applied in order. + +Key files: + +| File | Purpose | +|------|---------| +| `alembic.ini` | Alembic configuration (in repo root) | +| `alembic/env.py` | Migration environment setup | +| `alembic/script.py.mako` | Template for new migration scripts | +| `alembic/versions/` | Migration scripts (40+) | +| `apply_migrations.py` | Script to apply migrations using env vars for the connection string | + +### Creating a Migration + +```bash +alembic revision --autogenerate -m "Description for migration" +``` + +This generates a new file in `alembic/versions/` based on differences between the current models and the database schema. **Always review the generated `upgrade()` and `downgrade()` functions** before committing. + +### Applying Migrations + +Migrations are applied automatically on every deployment (via `execute.sh`). To apply manually: + +```bash +python apply_migrations.py +``` + +The script reads `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_HOST`, `POSTGRES_PORT`, and `POSTGRES_DB` from the environment, constructs a connection string, and runs `alembic upgrade head`. + +### Migration Testing + +The `tests/alembic/test_revisions.py` test validates that migration scripts are well-formed. This runs in CI on every pull request. + +## Data Sources App Synchronization + +The Source Manager synchronizes data to the Data Sources App hourly via nine scheduled tasks. These cover three entities (agencies, data sources, meta URLs) across three operations (add, update, delete). + +### How Sync Works + +Each sync task: + +1. Queries the SM database for entities needing sync (new entries, updated entries, or entries flagged for deletion). +2. Sends a request to the DS App endpoint at `/v3/sync/{entity}/{action}`. +3. Updates local DS App Link tables to record the sync. + +### Sync Preconditions + +**Add:** +- Agencies must be linked to a location. +- Data sources must be validated and linked to an agency. +- Meta URLs must be validated and linked to an agency. + +**Update:** +- Triggered when relevant tables are modified (entity row, link tables, metadata). + +**Delete:** +- When an entity is deleted in SM, a deletion flag is added to its DS App Link entry. +- The sync task reads these flags, sends the delete to DS, and removes the link entry. + +### Sync Flags + +Each sync task can be individually disabled: + +```dotenv +DS_APP_SYNC_AGENCY_ADD_TASK_FLAG=0 +DS_APP_SYNC_AGENCY_UPDATE_TASK_FLAG=0 +DS_APP_SYNC_AGENCY_DELETE_TASK_FLAG=0 +DS_APP_SYNC_DATA_SOURCE_ADD_TASK_FLAG=0 +DS_APP_SYNC_DATA_SOURCE_UPDATE_TASK_FLAG=0 +DS_APP_SYNC_DATA_SOURCE_DELETE_TASK_FLAG=0 +DS_APP_SYNC_META_URL_ADD_TASK_FLAG=0 +DS_APP_SYNC_META_URL_UPDATE_TASK_FLAG=0 +DS_APP_SYNC_META_URL_DELETE_TASK_FLAG=0 +``` + +## CI/CD Pipelines + +### Test Pipeline (`.github/workflows/test_app.yml`) + +Runs on every pull request: +- Spins up a PostgreSQL 15 service container. +- Installs dependencies via uv. +- Runs `pytest tests/automated` and `pytest tests/alembic`. +- 20-minute timeout. + +### Lint Pipeline (`.github/workflows/python_checks.yml`) + +Runs on every pull request: +- Runs flake8 via reviewdog. +- Posts advisory warnings as PR comments. +- Does **not** block merges. + +Note: `python_checks.yml` only works on pull requests from branches within the repo, not from forks. diff --git a/docs/development.md b/docs/development.md new file mode 100644 index 00000000..7ea62cc6 --- /dev/null +++ b/docs/development.md @@ -0,0 +1,182 @@ +# Local Development Guide + +## Prerequisites + +- **Python 3.11+** +- **[uv](https://docs.astral.sh/uv/)** — Python package manager +- **[Docker](https://docs.docker.com/get-started/get-docker/)** — required for the local PostgreSQL database + +## Quick Start + +```bash +# 1. Install dependencies +uv sync + +# 2. Start the local database +cd local_database +docker compose up -d +cd .. + +# 3. Create your .env file (see Environment Variables below) + +# 4. Run the app +fastapi dev main.py +``` + +Then open `http://localhost:8000/api` for the interactive API docs. + +## Environment Variables + +Create a `.env` file in the repository root. See [ENV.md](../ENV.md) for the full reference. + +### Minimum for Local Development + +At minimum, you need the database connection variables: + +```dotenv +POSTGRES_USER=test_source_collector_user +POSTGRES_PASSWORD=HanviliciousHamiltonHilltops +POSTGRES_DB=source_collector_test_db +POSTGRES_HOST=127.0.0.1 +POSTGRES_PORT=5432 +DEV=true +``` + +These match the defaults in `local_database/docker-compose.yml`. + +### API Keys + +You'll need additional keys depending on which features you're working on: + +| Variable | Required For | +|----------|-------------| +| `DS_APP_SECRET_KEY` | Any authenticated endpoint | +| `GOOGLE_API_KEY`, `GOOGLE_CSE_ID` | Auto-Googler collector | +| `DEEPSEEK_API_KEY` or `OPENAI_API_KEY` | LLM-powered tasks | +| `HUGGINGFACE_INFERENCE_API_KEY` | ML classification tasks | +| `HUGGINGFACE_HUB_TOKEN` | Uploading to HuggingFace | +| `PDAP_EMAIL`, `PDAP_PASSWORD`, `PDAP_API_KEY`, `PDAP_API_URL` | Syncing to the Data Sources App | +| `DISCORD_WEBHOOK_URL` | Error notifications | +| `INTERNET_ARCHIVE_S3_KEYS` | Internet Archive integration | + +### Feature Flags + +All features are enabled by default. To disable a feature during development, set its flag to `0`: + +```dotenv +SCHEDULED_TASKS_FLAG=0 # Disable all scheduled tasks +POST_TO_DISCORD_FLAG=0 # Disable Discord notifications +``` + +See [ENV.md](../ENV.md) for the full list of flags. + +## Database Setup + +### Option 1: Clean Local Database + +This gives you an empty database — good for running tests and isolated development. + +```bash +cd local_database +docker compose up -d +``` + +The database schema is automatically created on app startup via Alembic migrations. + +To stop the database: + +```bash +cd local_database +docker compose down +``` + +### Option 2: Mirrored Production Database + +This gives you a local copy of production data — useful for debugging or working with realistic data. + +```bash +python start_mirrored_local_app.py +``` + +This script: +1. Starts the local database container. +2. Runs the DataDumper to pull a snapshot from production (cached for 24 hours). +3. Restores the snapshot into your local database. +4. Applies any pending Alembic migrations. +5. Starts the FastAPI server. + +The mirrored approach requires additional environment variables for the production database connection. See the `Data Dumper` section in [ENV.md](../ENV.md). + +## Database Migrations + +This project uses [Alembic](https://alembic.sqlalchemy.org/) for database migrations. + +### Creating a New Migration + +```bash +alembic revision --autogenerate -m "Description for migration" +``` + +Then review the generated file in `alembic/versions/` and adjust the `upgrade()` and `downgrade()` functions as needed. + +### Applying Migrations + +Migrations are applied automatically on app startup. To apply manually: + +```bash +python apply_migrations.py +``` + +Or using alembic directly: + +```bash +alembic upgrade head +``` + +See `alembic/README.md` for more details. + +## Project Structure + +``` +. +├── src/ # Application source code +│ ├── api/ # FastAPI routers and endpoints +│ ├── core/ # Integration layer and task system +│ ├── db/ # Database models, client, queries +│ ├── collectors/ # URL collection strategies +│ ├── external/ # External service clients +│ ├── security/ # Authentication and authorization +│ └── util/ # Shared utilities +├── tests/ # Test suite +├── alembic/ # Database migrations +├── local_database/ # Docker setup for local PostgreSQL +├── docs/ # Documentation (you are here) +├── main.py # Alternative entry point +├── docker-compose.yml # Test environment (app + database) +├── Dockerfile # Production container +└── ENV.md # Full environment variable reference +``` + +## Common Workflows + +### Adding a New API Endpoint + +1. Create a directory under `src/api/endpoints//`. +2. Follow the existing pattern: `routes.py` for the router, subdirectories for each HTTP method. +3. Include the router in `src/api/main.py`. + +### Adding a New Collector + +See [collectors.md](collectors.md) for the full guide. + +### Adding a New Scheduled Task + +1. Create a new task operator in `src/core/tasks/scheduled/impl/`. +2. Register it in the scheduled task loader (`src/core/tasks/scheduled/loader.py`). +3. Add a corresponding flag in `EnvVarManager` and document it in `ENV.md`. + +### Adding a New URL Task Operator + +1. Create a new operator in `src/core/tasks/url/operators/`. +2. Register it in the URL task loader (`src/core/tasks/url/loader.py`). +3. Add a corresponding flag in `EnvVarManager` and document it in `ENV.md`. diff --git a/docs/testing.md b/docs/testing.md new file mode 100644 index 00000000..05176797 --- /dev/null +++ b/docs/testing.md @@ -0,0 +1,120 @@ +# Testing Guide + +## Prerequisites + +- **Docker** must be installed and the Docker engine must be running. +- **uv** for dependency management. + +## Test Categories + +Tests are organized into three categories: + +### Automated Tests (`tests/automated/`) + +These run in CI and do not call any third-party APIs. They include: + +- **Integration tests** — API endpoints, database operations, core functionality, security, and tasks. +- **Unit tests** — isolated logic tests. + +### Alembic Tests (`tests/alembic/`) + +Validates that database migration scripts are well-formed and can be applied cleanly. + +### Manual Tests (`tests/manual/`) + +Tests that call third-party APIs (Google, MuckRock, etc.) and are **not** run automatically. The directory intentionally lacks the `test` prefix to prevent accidental inclusion in pytest runs. + +Run these individually and only when needed — they may incur API costs. + +## Running Tests Locally + +### Option 1: Direct (Recommended for Development) + +Start the local database, then run pytest: + +```bash +# Start the database +cd local_database +docker compose up -d +cd .. + +# Run automated tests +uv run pytest tests/automated + +# Run alembic tests +uv run pytest tests/alembic +``` + +### Option 2: Docker Compose (Matches CI) + +This spins up a two-container setup (FastAPI app + PostgreSQL): + +```bash +docker compose up -d +``` + +Then run tests inside the container: + +```bash +docker exec data-source-identification-app-1 pytest /app/tests/automated +``` + +**Note:** The `docker-compose.yml` in the root is configured for Linux Docker (used in GitHub Actions). For local development on Windows/macOS, you may need to change `POSTGRES_HOST` from `172.17.0.1` to `host.docker.internal`. See the comments in `docker-compose.yml`. + +## CI Pipeline + +The GitHub Actions workflow (`.github/workflows/test_app.yml`) runs on every pull request: + +1. Starts a PostgreSQL 15 service container with health checks. +2. Installs uv and project dependencies. +3. Runs `pytest tests/automated`. +4. Runs `pytest tests/alembic`. + +The pipeline has a 20-minute timeout. + +A separate workflow (`.github/workflows/python_checks.yml`) runs flake8 linting via reviewdog on pull requests. These are advisory warnings and do not block merges. + +## Test Structure + +``` +tests/ +├── conftest.py # Session fixtures: DB setup, teardown, client instances +├── helpers/ +│ ├── alembic_runner.py # Alembic test utilities +│ ├── data_creator/ # Test data generation helpers +│ └── setup/ # Database populate/wipe utilities +├── test_data/ # Static test data (JSON files, etc.) +├── automated/ +│ ├── integration/ +│ │ ├── api/ # Endpoint tests (agencies, annotate, batch, etc.) +│ │ ├── core/ # Core async operation tests +│ │ ├── db/ +│ │ │ ├── client/ # Database client method tests +│ │ │ └── structure/ # Schema validation tests +│ │ ├── readonly/ # Read-only operation tests +│ │ ├── security_manager/ # Auth/authz tests +│ │ └── tasks/ # Task implementation tests +│ └── unit/ # Unit tests +├── alembic/ +│ └── test_revisions.py # Migration validation +└── manual/ + ├── core/lifecycle/ # Core lifecycle tests + ├── source_collectors/ # Collector integration tests + └── unsorted/ # Miscellaneous manual tests +``` + +## Test Configuration + +From `pytest.ini`: + +- **Timeout:** 300 seconds per test. +- **Async mode:** `auto` (all async tests are automatically detected). +- **Fixture loop scope:** `function` (each test gets its own event loop). +- **Manual tests** are marked and excluded from automated runs. + +## Writing New Tests + +- Place automated tests in `tests/automated/integration/` or `tests/automated/unit/`. +- Use the fixtures defined in `conftest.py` for database access (`adb_client`, `db_client`). +- Use helpers in `tests/helpers/data_creator/` to generate test data. +- If your test calls a third-party API, place it in `tests/manual/` and do **not** prefix the directory with `test`.