feat: Internet Archive Collector#585
feat: Internet Archive Collector#585labradorite-dev wants to merge 6 commits intoPolice-Data-Accessibility-Project:devfrom
Conversation
maxachis
left a comment
There was a problem hiding this comment.
This is looking good! The last thing I'd ask for is three small things:
- Make sure it passes flake8 linting.
- Have it point to dev, rather than main, for the merge
- When you run the manual test, just provide around 10 of the URLs it gathers.
Do all that, and this will be good to merge!
|
|
||
|
|
||
| def upgrade() -> None: | ||
| switch_enum_type( |
There was a problem hiding this comment.
Good call on this! It also reminded me that having an enum as a column is an antipattern, which I've noted for addressing here.
Manual Test Results — Sample URLsRan the Internet Archive crawler against couple URLs Domain:
|
| # | URL |
|---|---|
| 1 | http://www.cityofchicago.org:80/ |
| 2 | http://www.cityofchicago.org:80/AboutTown.html |
| 3 | http://www.cityofchicago.org:80/AdminHearings/ |
| 4 | http://www.cityofchicago.org:80/AdminHearings/About.html |
| 5 | http://www.cityofchicago.org:80/AdminHearings/Contacts.html |
| 6 | http://www.cityofchicago.org:80/AdminHearings/Division.html |
| 7 | http://www.cityofchicago.org:80/AdminHearings/FAQ.html |
| 8 | http://www.cityofchicago.org:80/AdminHearings/Hearing.html |
| 9 | http://cityofchicago.org:80/AdminHearings/AdminLawOfficers.html |
| 10 | http://www.cityofchicago.org:80/AdminHearings/AHPubServIntern.html |
|
May I also suggest a hook using |
This is an excellent idea! We may be interested in using prek instead, just because it's built in rust and may be faster, but regardless, let's look into it more deeply! Also, it appears we have a test failing! Resolve that, and we should be good! @josh-chamberlain can we add @labradorite-dev as a contributor? That way he can have these runs occur automatically. I will happily vouch for him. |
Implement a new collector that uses the Internet Archive CDX API to discover archived URLs on domains PDAP already knows about. Users provide seed URLs, domains are extracted, and the Wayback Machine is searched for all archived pages with filtering for mime types, URL patterns, and dedup.
…e crawler Add mocked integration tests (happy path, empty domain, API error) and a manual lifecycle test hitting the live CDX API. Also fix missing 'internet_archive' value in batch_strategy DB enum and SQLAlchemy model.
The mime_type_allowlist already filters out non-HTML content, making the static asset file extension patterns unnecessary.
Add missing module, class, and method docstrings (D100-D107) and type annotations (ANN101, ANN001, ANN201, ANN204) to all Internet Archive collector files to satisfy flake8 linting requirements.
b472c26 to
a619038
Compare
Update alembic migration down_revision to chain off latest dev head and fix renamed get_access_info -> get_admin_access_info in IA route.
Fixed the issues, looks like I should have branched off of |
|
@labradorite-dev invited as a |
Summary
Resolves #379 by creating an internet archive collector. Tested via unit tests and manual test script.
I tried my best to follow repo standards but if there's anything I can do better pls let me know!