Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
cc72350
feat(collector): add Internet Archive crawler collector
labradorite-dev Feb 12, 2026
54555aa
test(collector): add integration and manual tests for Internet Archiv…
labradorite-dev Feb 16, 2026
5cf239f
fix(collector): use structured logging instead of print in Internet A…
labradorite-dev Feb 16, 2026
1ec5469
refactor(collector): remove redundant static asset exclude patterns
labradorite-dev Feb 16, 2026
a619038
style(collector): add docstrings and type annotations to pass flake8
labradorite-dev Feb 16, 2026
97c5e3c
fix(collector): resolve CI failures after rebasing onto dev
labradorite-dev Feb 18, 2026
657fd08
Merge pull request #589 from Police-Data-Accessibility-Project/379-in…
maxachis Feb 19, 2026
a450964
fix(ia-save): verify async save status before marking success
maxachis Feb 27, 2026
180365b
fix(ia-save): authorize status polling requests
maxachis Feb 27, 2026
b0ab4d4
Remove agency_described_not_in_database from data sources metadata
maxachis Feb 27, 2026
25be7a8
Remove reviewdog and document lint directives
maxachis Feb 27, 2026
d410574
Enforce national-only locations for federal agencies
maxachis Feb 27, 2026
ef4fd2f
Add prek-based pre-commit lint hook
maxachis Feb 27, 2026
9200445
Add auth guards to sensitive mutating endpoints
maxachis Feb 27, 2026
baf7550
feat(benchmark): add annotation load time baseline benchmarks (#566)
labradorite-dev Feb 25, 2026
6b60ec7
feat(benchmark): add 10k-URL scale seeder and scale benchmark tests (…
labradorite-dev Feb 25, 2026
1cea1da
fix(benchmark): replace per-URL suggestion loops with bulk inserts (#…
labradorite-dev Feb 25, 2026
e9db7d3
fix(benchmark): use confidence=1.0 for location suggestions (#566)
labradorite-dev Feb 25, 2026
51596ca
feat(db): materialize url_annotation_count_view and url_annotation_fl…
labradorite-dev Feb 26, 2026
2c414e5
chore(deps): upgrade pytest-benchmark from 4.0 to 5.2.3
labradorite-dev Feb 26, 2026
825eabb
fix(annotate): use LEFT JOINs on materialized views; refresh in sort …
labradorite-dev Feb 27, 2026
90118a0
style: fix lint issues in our changes (#566)
labradorite-dev Feb 27, 2026
b9a0c27
fix(migration): rebase onto upstream 1fb2286a016c to resolve multiple…
labradorite-dev Feb 27, 2026
f826efd
ci(benchmark): post results to GHA job summary (#566)
labradorite-dev Feb 27, 2026
d0861d0
ci(benchmark): capture per-phase timings in GHA job summary (#566)
labradorite-dev Feb 27, 2026
864a8bd
fix(benchmark): use removesuffix to avoid mangling phase labels (#566)
labradorite-dev Feb 27, 2026
72a1810
docs(benchmark): simplify README; remove stale diagram and hardcoded …
labradorite-dev Feb 27, 2026
ed5a7e5
style: fix flake8 warnings in benchmark test and timing module (#566)
labradorite-dev Feb 27, 2026
9aae5a4
style: add missing docstrings to benchmark package, conftest, and sca…
labradorite-dev Feb 27, 2026
00f24ec
style: fix flake8 annotations from PR check (#566)
labradorite-dev Feb 27, 2026
35248e0
refactor(benchmark): extract summary heredoc to scripts/post_benchmar…
labradorite-dev Feb 28, 2026
80b77b8
docs(development): replace hardcoded dev db password with reference t…
labradorite-dev Feb 28, 2026
7cb1767
fix(review): address PR feedback from maxachis
labradorite-dev Mar 8, 2026
4f37aed
refactor(benchmark): replace _phase timing with pyinstrument profiler…
labradorite-dev Mar 8, 2026
3e710ea
fix(tests): refresh materialized views in annotation setup helpers (#…
labradorite-dev Mar 9, 2026
5779874
Merge pull request #590 from Police-Data-Accessibility-Project/mc_387…
maxachis Mar 9, 2026
6a621fc
Merge pull request #591 from Police-Data-Accessibility-Project/fix-51…
maxachis Mar 9, 2026
6fba494
Merge pull request #593 from Police-Data-Accessibility-Project/fix-59…
maxachis Mar 9, 2026
c681400
Merge pull request #594 from Police-Data-Accessibility-Project/fix-53…
maxachis Mar 9, 2026
625e18d
Resolve merge conflict in python_checks.yml
maxachis Mar 9, 2026
c834e5d
Merge pull request #595 from Police-Data-Accessibility-Project/fix-58…
maxachis Mar 9, 2026
b52c99e
Merge pull request #596 from Police-Data-Accessibility-Project/fix-57…
maxachis Mar 9, 2026
d060b20
Merge remote-tracking branch 'origin/dev' into issue-566-optimize-ann…
maxachis Mar 9, 2026
c3c360e
Add PR to main trigger for benchmark workflow
maxachis Mar 9, 2026
9e5c124
Add Alembic merge migration to resolve multiple heads
maxachis Mar 9, 2026
ee09ea1
Merge pull request #601 from Police-Data-Accessibility-Project/issue-…
maxachis Mar 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[flake8]
ignore = E501,W291,W293,D401,D400,E402,E302,D200,D202,D205,W503,E203,D204,D403
extend-exclude = .venv,.uv-cache
66 changes: 66 additions & 0 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
name: Annotation Benchmark

on:
workflow_dispatch:
pull_request:
branches:
- main

jobs:
benchmark:
runs-on: ubuntu-latest
timeout-minutes: 30
container: python:3.11.9

services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: postgres
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5

env:
POSTGRES_PASSWORD: postgres
POSTGRES_USER: postgres
POSTGRES_DB: postgres
POSTGRES_HOST: postgres
POSTGRES_PORT: 5432
GOOGLE_API_KEY: TEST
GOOGLE_CSE_ID: TEST
PROFILE_DIR: profiles

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Install uv and set the python version
uses: astral-sh/setup-uv@v5

- name: Install the project
run: uv sync --locked --group dev

- name: Create profiles directory
run: mkdir -p profiles

- name: Run benchmark tests
run: |
uv run pytest tests/automated/integration/benchmark \
-m "manual and benchmark" \
--benchmark-json=benchmark-results.json \
-v

- name: Post benchmark summary
run: uv run python scripts/post_benchmark_summary.py

- name: Upload benchmark results
uses: actions/upload-artifact@v4
with:
name: benchmark-results-${{ github.sha }}
path: |
benchmark-results.json
profiles/
retention-days: 90
2 changes: 1 addition & 1 deletion .github/workflows/python_checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@ jobs:
uses: reviewdog/action-flake8@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
flake8_args: --ignore E501,W291,W293,D401,D400,E402,E302,D200,D202,D205,W503,E203,D204,D403
flake8_args: .
level: warning
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,5 @@ runs/
model.safetensors
training_args.bin
/temp/
.devcontainer/.env.secrets
.devcontainer/
11 changes: 11 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
repos:
- repo: https://github.com/pycqa/flake8
rev: 7.3.0
hooks:
- id: flake8
additional_dependencies:
- flake8-docstrings
- flake8-simplify
- flake8-unused-arguments
- flake8-annotations

4 changes: 4 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,10 @@ See `ENV.md` for the full reference. Key categories:

## Important Notes

- Follow the repository's linting style guide when writing or editing Python:
- Run `flake8` with this exact ignore set in mind: `E501,W291,W293,D401,D400,E402,E302,D200,D202,D205,W503,E203,D204,D403`
- Keep import ordering and blank-line usage compliant with `flake8` defaults aside from the ignore list above
- Prefer concise docstrings where present and avoid introducing new docstring violations outside ignored codes
- The app exposes its API docs at `/api` (not the default `/docs` — `/docs` redirects to `/api`).
- CORS is configured for `localhost:8888`, `pdap.io`, and `pdap.dev`.
- Two permission levels: `access_source_collector` (general) and `source_collector_final_review` (final review).
Expand Down
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,16 @@ Thank you for your interest in contributing to this project! Please follow these

## Code Quality

Docstrings and type hints are checked via a GitHub Action (`python_checks.yml`) using [pydocstyle](https://www.pydocstyle.org/en/stable/) and [mypy](https://mypy-lang.org/). These produce advisory PR comments and do *not* block merges.
Linting runs via flake8 in GitHub Actions (`python_checks.yml`) and posts advisory review comments (non-blocking).

To run the same lint checks before committing, install and use the `prek` pre-commit hook runner:

```bash
uv sync --group dev
uv run prek install

# run hooks on all files
uv run prek run --all-files
```

Note: `python_checks.yml` only runs on pull requests from within the repo, not from forks.
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
"""add internet_archive to batch_strategy enum

Revision ID: 1fb2286a016c
Revises: 1d3398f9cd8a
Create Date: 2026-02-15 12:57:34.550327

"""
from typing import Sequence, Union

from alembic import op

from src.util.alembic_helpers import switch_enum_type

# revision identifiers, used by Alembic.
revision: str = '1fb2286a016c'
down_revision: Union[str, None] = '759ce7d0772b'
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None


def upgrade() -> None:

Check warning on line 21 in alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py#L21 <103>

Missing docstring in public function
Raw output
./alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py:21:1: D103 Missing docstring in public function

Check warning on line 21 in alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py#L21 <103>

Missing docstring in public function
Raw output
./alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py:21:1: D103 Missing docstring in public function
switch_enum_type(
table_name="batches",
column_name="strategy",
enum_name="batch_strategy",
new_enum_values=[
"example",
"ckan",
"muckrock_county_search",
"auto_googler",
"muckrock_all_search",
"muckrock_simple_search",
"common_crawler",
"manual",
"internet_archive",
],
)


def downgrade() -> None:

Check warning on line 40 in alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py#L40 <103>

Missing docstring in public function
Raw output
./alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py:40:1: D103 Missing docstring in public function

Check warning on line 40 in alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py

View workflow job for this annotation

GitHub Actions / flake8

[flake8] alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py#L40 <103>

Missing docstring in public function
Raw output
./alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py:40:1: D103 Missing docstring in public function
op.execute("""
DELETE FROM BATCHES
WHERE STRATEGY = 'internet_archive'
""")

switch_enum_type(
table_name="batches",
column_name="strategy",
enum_name="batch_strategy",
new_enum_values=[
"example",
"ckan",
"muckrock_county_search",
"auto_googler",
"muckrock_all_search",
"muckrock_simple_search",
"common_crawler",
"manual",
],
)
Loading