feat: Internet Archive Collector by labradorite-dev · Pull Request #585 · Police-Data-Accessibility-Project/data-source-manager

labradorite-dev · 2026-02-16T04:17:05Z

Summary

Resolves #379 by creating an internet archive collector. Tested via unit tests and manual test script.

I tried my best to follow repo standards but if there's anything I can do better pls let me know!

maxachis

This is looking good! The last thing I'd ask for is three small things:

Make sure it passes flake8 linting.
Have it point to dev, rather than main, for the merge
When you run the manual test, just provide around 10 of the URLs it gathers.

Do all that, and this will be good to merge!

maxachis · 2026-02-16T11:09:01Z

alembic/versions/2026_02_15_1257-1fb2286a016c_add_internet_archive_to_batch_strategy_.py

+
+
+def upgrade() -> None:
+    switch_enum_type(


Good call on this! It also reminded me that having an enum as a column is an antipattern, which I've noted for addressing here.

labradorite-dev · 2026-02-16T20:15:02Z

Manual Test Results — Sample URLs

Ran the Internet Archive crawler against couple URLs www.cityofchicago.org

Domain: `www.cityofchicago.org`

#	URL
1	`http://www.cityofchicago.org:80/`
2	`http://www.cityofchicago.org:80/AboutTown.html`
3	`http://www.cityofchicago.org:80/AdminHearings/`
4	`http://www.cityofchicago.org:80/AdminHearings/About.html`
5	`http://www.cityofchicago.org:80/AdminHearings/Contacts.html`
6	`http://www.cityofchicago.org:80/AdminHearings/Division.html`
7	`http://www.cityofchicago.org:80/AdminHearings/FAQ.html`
8	`http://www.cityofchicago.org:80/AdminHearings/Hearing.html`
9	`http://cityofchicago.org:80/AdminHearings/AdminLawOfficers.html`
10	`http://www.cityofchicago.org:80/AdminHearings/AHPubServIntern.html`

labradorite-dev · 2026-02-16T20:38:59Z

May I also suggest a hook using pre-commit to shift the lint check even further left?

maxachis · 2026-02-18T10:48:34Z

May I also suggest a hook using pre-commit to shift the lint check even further left?

This is an excellent idea! We may be interested in using prek instead, just because it's built in rust and may be faster, but regardless, let's look into it more deeply!

Also, it appears we have a test failing! Resolve that, and we should be good!

@josh-chamberlain can we add @labradorite-dev as a contributor? That way he can have these runs occur automatically. I will happily vouch for him.

Implement a new collector that uses the Internet Archive CDX API to discover archived URLs on domains PDAP already knows about. Users provide seed URLs, domains are extracted, and the Wayback Machine is searched for all archived pages with filtering for mime types, URL patterns, and dedup.

…e crawler Add mocked integration tests (happy path, empty domain, API error) and a manual lifecycle test hitting the live CDX API. Also fix missing 'internet_archive' value in batch_strategy DB enum and SQLAlchemy model.

…rchive collector

The mime_type_allowlist already filters out non-HTML content, making the static asset file extension patterns unnecessary.

Add missing module, class, and method docstrings (D100-D107) and type annotations (ANN101, ANN001, ANN201, ANN204) to all Internet Archive collector files to satisfy flake8 linting requirements.

Update alembic migration down_revision to chain off latest dev head and fix renamed get_access_info -> get_admin_access_info in IA route.

labradorite-dev · 2026-02-18T15:45:16Z

May I also suggest a hook using pre-commit to shift the lint check even further left?

This is an excellent idea! We may be interested in using prek instead, just because it's built in rust and may be faster, but regardless, let's look into it more deeply!

Also, it appears we have a test failing! Resolve that, and we should be good!

@josh-chamberlain can we add @labradorite-dev as a contributor? That way he can have these runs occur automatically. I will happily vouch for him.

Fixed the issues, looks like I should have branched off of dev instead of main.

josh-chamberlain · 2026-02-18T20:07:37Z

@labradorite-dev invited as a writer

labradorite-dev requested a review from josh-chamberlain as a code owner February 16, 2026 04:17

maxachis requested changes Feb 16, 2026

View reviewed changes

labradorite-dev changed the base branch from main to dev February 16, 2026 20:13

labradorite-dev added 5 commits February 18, 2026 07:03

fix(collector): use structured logging instead of print in Internet A…

5cf239f

…rchive collector

refactor(collector): remove redundant static asset exclude patterns

1ec5469

The mime_type_allowlist already filters out non-HTML content, making the static asset file extension patterns unnecessary.

style(collector): add docstrings and type annotations to pass flake8

a619038

Add missing module, class, and method docstrings (D100-D107) and type annotations (ANN101, ANN001, ANN201, ANN204) to all Internet Archive collector files to satisfy flake8 linting requirements.

labradorite-dev force-pushed the 379-internet-archive-crawler branch from b472c26 to a619038 Compare February 18, 2026 15:04

fix(collector): resolve CI failures after rebasing onto dev

97c5e3c

Update alembic migration down_revision to chain off latest dev head and fix renamed get_access_info -> get_admin_access_info in IA route.

labradorite-dev requested a review from maxachis February 18, 2026 15:45

labradorite-dev closed this Feb 19, 2026

labradorite-dev deleted the 379-internet-archive-crawler branch February 19, 2026 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Internet Archive Collector#585

feat: Internet Archive Collector#585
labradorite-dev wants to merge 6 commits intoPolice-Data-Accessibility-Project:devfrom
labradorite-dev:379-internet-archive-crawler

labradorite-dev commented Feb 16, 2026

Uh oh!

maxachis left a comment

Uh oh!

maxachis Feb 16, 2026

Uh oh!

labradorite-dev commented Feb 16, 2026 •

edited

Loading

Uh oh!

labradorite-dev commented Feb 16, 2026

Uh oh!

maxachis commented Feb 18, 2026

Uh oh!

labradorite-dev commented Feb 18, 2026

Uh oh!

josh-chamberlain commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

labradorite-dev commented Feb 16, 2026

Summary

Uh oh!

maxachis left a comment

Choose a reason for hiding this comment

Uh oh!

maxachis Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

labradorite-dev commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Manual Test Results — Sample URLs

Domain: www.cityofchicago.org

Uh oh!

labradorite-dev commented Feb 16, 2026

Uh oh!

maxachis commented Feb 18, 2026

Uh oh!

labradorite-dev commented Feb 18, 2026

Uh oh!

josh-chamberlain commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

labradorite-dev commented Feb 16, 2026 •

edited

Loading

Domain: `www.cityofchicago.org`