feat(db): improve `GET /annotate/all` performance by labradorite-dev · Pull Request #601 · Police-Data-Accessibility-Project/data-source-manager

labradorite-dev · 2026-02-28T22:19:04Z

Summary

Improves GET /annotate/all performance.

Background Context

Issue #566 describes improving performance of an endpoint. However, before improving performance I first wanted to measure current performance to drive the decision with data. As such I established a benchmark using pytest-benchmark. To back the benchmark with scale, it seeds 10k URLs to populate some relatively realistic annotation data.

The benchmark determined that main_query_s was by far the largest portion of wall-clock time in this endpoint and so I focused my efforts on improving that performance. I considered a number of different solutions but by far the most effective is to convert the url_annotation_count_view and url_annotation_flags from regular PostgreSQL views to materialized views with unique indexes on url_id.

Performance Improvement

Measured locally with pyinstrument via the new *_profiled benchmark tests:

Test	Before (dev)	After	Improvement
Small dataset (`profiled`)	12.21 ms	5.99 ms	-51%
10k URLs (`scale_profiled`)	41.72 ms	3.53 ms	-92%

Staleness Tradeoff

I added the materialized views to the existing RefreshMaterializedViewsOperator schedule which is once per day. In the worst case, URL ranking could be up to 24 hours stale - e.g. a URL that just received its 10th annotation may still be ranked highly when it should be deprioritized. My personal viewpoint is that this is acceptable because:

No user-facing data is incorrect - only the ordering of work is affected
No writes or authorization decisions are gated on these values
Annotators are directed to URLs in a slightly suboptimal order, not a wrong one
Follows the same trade-off already accepted by the 3 existing materialized views in the codebase

If fresher data is a product priority, I would recommend a more aggressive refresh rate (such as hourly) for the materialized view.

Alternative Long Term Solution

I would be remiss to not include the other, much larger refactor I considered which is to implement proper CQRS. The core of the issue here is that data is prioritized for writes, NOT reads. Then, at query time we pay the price and have to stitch all the data together. Materialized views shift that data stitching slightly left (once a day in a separate job), but a potentially more "complete" fix would be to add a new, fully denormalized table which looks like this:

-- read model, updated on every write command
annotation_queue (
    url_id        PRIMARY KEY,
    priority_rank INT,  -- pre-computed: manual > followed > count > id
    total_count   INT,
    is_manual     BOOL,
    followed_by_user BOOL,
    ...
)

Then at each time we write a new annotation to be consumed by annotators, we also populate this table correctly. Then, at query time we have a primary key lookup (extremely cheap): SELECT * FROM annotation_queue ORDER BY priority_rank LIMIT 1.

If we are interested, I do think this could be a new issue if read performance will continue to be prioritized for the app.

Test plan

All 4 benchmark tests pass (test_benchmark_annotate_all_*)
Migration applies cleanly from scratch (alembic upgrade head)
CONCURRENTLY refresh works (unique indexes in place as prerequisite)

gitguardian · 2026-02-28T22:19:09Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
15086721	Triggered	Generic Password	`baf7550`	.github/workflows/benchmark.yml	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

labradorite-dev · 2026-02-28T22:22:49Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id GitGuardian status Secret Commit Filename
15086721 Triggered Generic Password f099307 .github/workflows/benchmark.yml View secret
🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.

Replace and store your secret safely. Learn here the best practices.

Revoke and rotate this secret.

If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials

install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Not a real secret (:

- Add pytest-benchmark~=4.0 dev dependency - Add ContextVar-based timing collector (timing.py) for zero-cost production instrumentation via _phase() context managers - Instrument extract_and_format_get_annotation_result() with per-phase timing: format_s, agency_suggestions_s, location_suggestions_s, name_suggestions_s, batch_info_s - Instrument GetNextURLForAllAnnotationQueryBuilder.run() with main_query_s timing - Add benchmark test suite under tests/automated/integration/benchmark/ with HTTP round-trip and per-phase breakdown tests - Add GHA workflow (.github/workflows/benchmark.yml) that runs on workflow_dispatch or PR to dev, uploads JSON artifact per commit SHA - Add README with Mermaid sequence diagram of the measured call chain

…566) Add scale_seed.py with create_scale_data() that seeds 10k URLs plus geographic hierarchy, agencies, name/agency/location suggestions to exercise the query planner under realistic load. Add scale_seeder fixture (module-scoped) and two new benchmark tests that mirror the existing small-data pair, printing per-phase averages for direct comparison.

) The sequential agency_auto_suggestions / add_location_suggestion loops (4 DB calls × 5k iterations each) blew through the 300s pytest timeout during fixture setup. Replace with 3 round-trips per suggestion type: initiate_task → bulk_insert subtasks (return_ids) → bulk_insert suggestions.

SuggestionModel.robo_confidence is typed int; Pydantic rejects float values with fractional parts (e.g. 0.9). Use 1.0 which coerces cleanly.

…ags (#566) Convert both annotation views from regular views to MATERIALIZED VIEWs with unique indexes on url_id. Add CONCURRENTLY refresh calls to refresh_materialized_views(). Drops main_query_s from ~177ms to ~0.64ms at 10k-URL scale (~276x improvement).

…test (#566) With materialized views, URLs added between refreshes have no row in the view. Inner joins excluded them entirely, making new URLs invisible to annotators. Switch to LEFT OUTER JOINs so URLs without a view row still appear with NULL/0 counts. The sort test explicitly verifies annotation-count-driven ordering, so it needs a manual refresh_materialized_views() call after data setup to reflect counts before querying.

- Migration: Union[str, None] -> Optional[str], add docstrings to upgrade/downgrade - async_.py: add missing newline at end of file

… heads (#566) Upstream dev merged 1fb2286a016c (add_internet_archive_to_batch_strategy) which also chains from 759ce7d0772b, creating two Alembic heads. Update our migration's down_revision to sit after upstream's.

…credentials

…le_seed (#566)

…k_summary.py (#566)

…o docker-compose GitGuardian flagged the literal default password added to docs/development.md. Replaced with a pointer to local_database/docker-compose.yml where it is already defined.

maxachis

So my impression of this is that the core idea and the solution is good! Great, even! I think benchmarking is something we lack in this repo, and it's forward-thinking to include it. And the proposed solution makes perfect sense -- those views definitely justify being materialized. I think this will definitely improve performance.

But the PR is also introducing new design principles (primarily the adjustment of code to be amenable to benchmarking) that I want to interrogate before committing. I'm not averse to introducing such principles if they're justified, but I do want to ensure they are justified, because that will be the pattern we follow when benchmarking other parts of this app and data-sources.

A secondary issue is that it's touching a few too many files that don't need touched for this PR. I wouldn't block on that, but I am noting it.

maxachis · 2026-03-02T01:24:12Z

tests/automated/integration/benchmark/test_annotate_all_benchmark.py

+from tests.automated.integration.benchmark.scale_seed import ScaleSeedResult
+from tests.automated.integration.readonly.helper import ReadOnlyTestHelper
+
+_PHASE_ATTRS = (


I do think this logic is a little difficult to parse. The questions that come to mind are:

What is a "phase" in this context?

Why are they written to JSON?

The answers are in the code, but I think it would benefit to have a bit of documentation explaining the workflow, because it doesn't seem immediately apparent.

We also create an environment variable that is used once and not documented at root level (we have an ENV.md file for this purpose). So I want to interrogate whether we should have it as an env , which I leave to your discretion, as I don't have a strong opinion on it other than wanting to interrogate it. If we do include it, make sure it's documented at root level (and perhaps note this in CLAUDE.md as part of the general style guide).

I added some more info to what a phase is in the README.md accompanying the benchmark test.

Regarding the ENV.md file, I didn't want to update that since it seems to only describe things about the runtime environment, whereas this is specific to a test harness. Happy to add if you still think it should go there 👍

maxachis · 2026-03-02T01:33:08Z

.github/workflows/benchmark.yml

+on:
+  workflow_dispatch:
+  pull_request:
+    branches: [dev]


I like the idea of having a benchmark that we can repeatedly review! But I'm not convinced we should have it run on every PR to dev -- main perhaps (which is not listed here), but I want to be sensitive to GitHub action usage limits. We may be nowhere near those limits (I'll ask @josh-chamberlain for deets on that), but it's something I want to flag for closer inspection.

The other thing is that I am assuming that relatively few code changes will impact annotation benchmarking, and in that context it may not be efficient to have a GitHub workflow solely dedicated to one (albeit important) part of the app that runs on every code change, because most of the time it will function just fine. The lint and test actions, for example, are always relevant; even if most of the code they test is unchanged, it can still fail on conceivably any change made. But that's not as probable with this action.

That might be an argument for a wider set of benchmarks, which would be a separate issue.

My first instinct is I like the benchmark, and I'm open to the idea of having GitHub actions for a more broad set of benchmarks. But I want to keep scope relatively narrow in this PR. I would either cut it for now or set it to workflow dispatch only.

Moved to workflow_dispatch only.

My humble opinion is that a manual check is just a check you never run, but I understand the concern.

Excellent! I think we can justify having it be there on dev -> main (and perhaps having it alert if there's a noticeable performance decrease between that an an artifact for the current version. If it does prove to be worth having on every dev PR as well, happy to add that. As is, this is a tremendous value add.

maxachis · 2026-03-02T01:34:54Z

alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py

 # revision identifiers, used by Alembic.
 revision: str = '1fb2286a016c'
-down_revision: Union[str, None] = '759ce7d0772b'
+down_revision: Optional[str] = '759ce7d0772b'


I've no objections to this in principle, but do keep an eye on keeping scope focused to what's relevant to this issue! We're affecting a number of files as-is, and tighter PRs make things simpler to review.

maxachis · 2026-03-02T01:38:27Z

src/api/endpoints/annotate/_shared/queries/helper.py

            URLAnnotationFlagsView,
-            URLAnnotationFlagsView.url_id == URL.id
+            URLAnnotationFlagsView.url_id == URL.id,
+            isouter=True


nit: I'm trying to think of a scenario where isouter would be necessary here. I don't think it will have a negative impact, since there are no duplicates in these views (at least as far as I know). But it's also not clear that it makes a meaningful difference.

I suppose, since a mat view will be on a lag, this would allow the query to pick up URLs that haven't yet been added to the mat views. But then we can't do much with those URLs because we don't have the attributes contained in those mat views!

Open to being proven wrong, but otherwise I would nix.

Yup, removed

maxachis · 2026-03-02T01:39:09Z

src/api/endpoints/annotate/_shared/extract.py

-            url.user_url_type_suggestions,
-            url.anon_url_type_suggestions
+    """Extract and format the annotation result for a URL."""
+    with _phase("format_s"):


nit: Why the _s underscore for these? If that's an established term, then mea culpa, but still need my peasant brain educated 🧠.

Common unit suffix for "seconds" for future reference; but I've changed this 👍

maxachis · 2026-03-02T01:45:56Z

src/api/endpoints/annotate/_shared/extract.py

    # Agencies
-    agency_suggestions: AgencyAnnotationResponseOuterInfo = \
-        await GetAgencySuggestionsQueryBuilder(url_id=url.id).run(session)
+    with _phase("agency_suggestions_s"):


I'm not deeply experienced in benchmarking, but I'm hesitant to adjust the code primarily to make it more amenable to benchmarking. In my (limited) experience, there are benchmarking libraries that can identify the biggest contributors without having to resort to these sorts of adjustments.

Again, not my area of knowledge, If there's a compelling reason for doing it like this, I'll approve and add it to a general architecture guide! But let's chat about it.

Understood. I think there is a use to having telemetry like this but I think it is perhaps out of scope of the issue. I've changed this to just profile at the top level. Results reveal the same improved performance (in updated PR body).

I'm interested in having that conversation! I'm all about making overall improvements.

- docs(benchmark): define 'phase' clearly in README - ci(benchmark): restrict benchmark workflow to workflow_dispatch only - fix(migration): restore internet_archive migration to match dev (style-only drift from lint passes) - fix(annotate): remove isouter from mat view joins; inner join is correct since missing rows are unusable

…#566)

)

josh-chamberlain

I appreciate the performance updates! This makes a big difference for labelers. Sticking my oar in with a comment about the broader structure:

e.g. a URL that just received its 10th annotation may still be ranked highly when it should be deprioritized. My personal viewpoint is that this is acceptable because:

No user-facing data is incorrect - only the ordering of work is affected

No writes or authorization decisions are gated on these values

Annotators are directed to URLs in a slightly suboptimal order, not a wrong one

Follows the same trade-off already accepted by the 3 existing materialized views in the codebase
If fresher data is a product priority, I would recommend a more aggressive refresh rate (such as hourly) for the materialized view.

(emphasis mine) I wasn't aware it worked like this!

Your thinking is sound, but if we experience a spike in labelers or do a labeling event (not unheard of) this could be pretty frustrating. If, say, 20 people were labeling, they would all be duplicating each other's efforts. Same goes for 100. Once a URL has reached the agreement threshold, it should be removed quickly from the annotation pool—ideally as soon as it reaches its threshold. Even if we lowered the refresh rate to an hour, we'd have pretty much the same problem: however many concurrent labelers we have, that's how many people would end up labeling a given URL.

Granted these are not problems we have at the moment, but labeling events are something we have done before and would like to do again, so I think it's worth making sure that's not an issue.

Am I thinking about this correctly / making sense?

maxachis · 2026-03-09T20:56:12Z

Am I thinking about this correctly / making sense?

@josh-chamberlain I think you're right that it's a risk, but my opinion is that if we have so many people annotating at once that they get duplicate data, that's good problem to have, compared to the delayed responses.

We could have a system where we trigger adjustments to the order once something gets approved, but I think that's out of scope for this bad boy.

…otation-load-time Resolve uv.lock conflict by regenerating lockfile. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

maxachis

I approve! I'll look into creating an issue to handle stale annotation data.

Three migrations branched from 1fb2286a016c after merging dev. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

labradorite-dev requested a review from josh-chamberlain as a code owner February 28, 2026 22:19

labradorite-dev added 18 commits February 28, 2026 14:29

fix(benchmark): use confidence=1.0 for location suggestions (#566)

e9db7d3

SuggestionModel.robo_confidence is typed int; Pydantic rejects float values with fractional parts (e.g. 0.9). Use 1.0 which coerces cleanly.

chore(deps): upgrade pytest-benchmark from 4.0 to 5.2.3

2c414e5

style: fix lint issues in our changes (#566)

90118a0

- Migration: Union[str, None] -> Optional[str], add docstrings to upgrade/downgrade - async_.py: add missing newline at end of file

ci(benchmark): post results to GHA job summary (#566)

f826efd

ci(benchmark): capture per-phase timings in GHA job summary (#566)

d0861d0

fix(benchmark): use removesuffix to avoid mangling phase labels (#566)

864a8bd

docs(benchmark): simplify README; remove stale diagram and hardcoded …

72a1810

…credentials

style: fix flake8 warnings in benchmark test and timing module (#566)

ed5a7e5

style: add missing docstrings to benchmark package, conftest, and sca…

9aae5a4

…le_seed (#566)

style: fix flake8 annotations from PR check (#566)

00f24ec

refactor(benchmark): extract summary heredoc to scripts/post_benchmar…

35248e0

…k_summary.py (#566)

docs(development): replace hardcoded dev db password with reference t…

80b77b8

…o docker-compose GitGuardian flagged the literal default password added to docs/development.md. Replaced with a pointer to local_database/docker-compose.yml where it is already defined.

labradorite-dev force-pushed the issue-566-optimize-annotation-load-time branch from e5dc903 to 80b77b8 Compare February 28, 2026 22:30

maxachis requested changes Mar 2, 2026

View reviewed changes

labradorite-dev added 3 commits March 8, 2026 09:38

refactor(benchmark): replace _phase timing with pyinstrument profiler (…

4f37aed

…#566)

fix(tests): refresh materialized views in annotation setup helpers (#566

3e710ea

)

labradorite-dev requested a review from maxachis March 9, 2026 00:40

josh-chamberlain reviewed Mar 9, 2026

View reviewed changes

Merge remote-tracking branch 'origin/dev' into issue-566-optimize-ann…

d060b20

…otation-load-time Resolve uv.lock conflict by regenerating lockfile. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add PR to main trigger for benchmark workflow

c3c360e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

maxachis approved these changes Mar 9, 2026

View reviewed changes

Add Alembic merge migration to resolve multiple heads

9e5c124

Three migrations branched from 1fb2286a016c after merging dev. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

maxachis merged commit ee09ea1 into dev Mar 9, 2026
3 of 4 checks passed

maxachis deleted the issue-566-optimize-annotation-load-time branch March 9, 2026 21:12

This was referenced Mar 9, 2026

Implement Annotation Queue (CQRS) #602

Open

Dev to Main — March 2026 #603

Open

josh-chamberlain mentioned this pull request Mar 10, 2026

GET annotations: mitigate volunteer pileup on the first annotation tasks #604

Open

2 tasks

Conversation

labradorite-dev commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background Context

Performance Improvement

Staleness Tradeoff

Alternative Long Term Solution

Test plan

Uh oh!

gitguardian bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

labradorite-dev commented Feb 28, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

maxachis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxachis Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxachis Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

josh-chamberlain left a comment

Choose a reason for hiding this comment

Uh oh!

maxachis commented Mar 9, 2026

Uh oh!

maxachis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

labradorite-dev commented Feb 28, 2026 •

edited

Loading

gitguardian bot commented Feb 28, 2026 •

edited

Loading

maxachis left a comment •

edited

Loading

maxachis Mar 2, 2026 •

edited

Loading

maxachis Mar 2, 2026 •

edited

Loading