Skip to content

Add option to write JSONL file with data on skipped pages#966

Merged
ikreymer merged 25 commits intomainfrom
issue-965-urls-not-queued-list
Apr 9, 2026
Merged

Add option to write JSONL file with data on skipped pages#966
ikreymer merged 25 commits intomainfrom
issue-965-urls-not-queued-list

Conversation

@tw4l
Copy link
Copy Markdown
Member

@tw4l tw4l commented Feb 5, 2026

Fixes #965

Add --reportSkipped argument, which will enable the creation of a reports/skippedPages.jsonl file with the following elements for each URL encountered that was not queued:

  • url
  • seedUrl
  • depth
  • reason (one of outOfScope, pageLimit, robotsTxt, or redirectToExcluded)
  • ts

The reports/ directory is new and will likely be expanded with other crawl-time reporting moving forward.

@tw4l tw4l changed the title Add option to write page JSONL file with all pages not queued Add option to write JSONL file with data on URLs not queued Feb 5, 2026
@tw4l tw4l force-pushed the issue-965-urls-not-queued-list branch from 25a80f2 to 4f78dde Compare February 10, 2026 21:19
@tw4l tw4l marked this pull request as ready for review February 10, 2026 22:52
@tw4l tw4l requested a review from ikreymer February 10, 2026 22:52
@tw4l tw4l marked this pull request as draft February 11, 2026 15:11
@tw4l
Copy link
Copy Markdown
Member Author

tw4l commented Feb 11, 2026

Moving back to draft while I work on adding the new reports dir to the WACZ

@tw4l tw4l force-pushed the issue-965-urls-not-queued-list branch from 220b267 to 9897393 Compare February 11, 2026 18:33
@tw4l tw4l marked this pull request as ready for review February 17, 2026 15:05
@ikreymer ikreymer force-pushed the issue-965-urls-not-queued-list branch from f5331ec to 6ae4987 Compare April 3, 2026 21:14
Comment thread tests/skipped_pages.test.ts
@ikreymer
Copy link
Copy Markdown
Member

ikreymer commented Apr 3, 2026

Re: naming, perhaps we could go with skippedPages.jsonl and the property --reportSkipped?
That way, other optional reports that could be enabled to be added to reports dir could start with --report<name>

tw4l added 2 commits April 8, 2026 11:19
url and ts are the only two required fields for lines in a pages
JSONL file according to our WACZ spec, and it may be useful to
when when a page was encountered, so this commit ensures that the
ts is properly written into the file for each line.
@tw4l tw4l requested a review from ikreymer April 8, 2026 15:37
@tw4l
Copy link
Copy Markdown
Member Author

tw4l commented Apr 8, 2026

I like the suggestion of skippedPages.jsonl and --reportSkipped, that's great, thanks. I've implemented the changes now if you want to take a second look.

tw4l and others added 5 commits April 8, 2026 12:23
For some reason, the datapackage validator is complaining about
the JSONL file in the reports directory, though it is fine with
JSONL files in the pages directory. Commenting out this particular
test for now.
record 'redirectOutOfScope' type to indicate pages that are excluded because they redirect to out-of-scope pages
@ikreymer ikreymer force-pushed the issue-965-urls-not-queued-list branch from b5bb316 to 9b1fef6 Compare April 9, 2026 07:27
@tw4l tw4l changed the title Add option to write JSONL file with data on URLs not queued Add option to write JSONL file with data on skipped pages Apr 9, 2026
ikreymer added 2 commits April 9, 2026 11:27
…ipped, if redirect is explicitly excluded

add test for redirectToExcluded
Copy link
Copy Markdown
Member

@ikreymer ikreymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few more changes:

  • added test for pageLimit skipped page
  • added redirectToExcluded skip reason and add test

Should be good to go now!

@ikreymer ikreymer merged commit 1c6e814 into main Apr 9, 2026
6 checks passed
@ikreymer ikreymer deleted the issue-965-urls-not-queued-list branch April 9, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add option to produce report of skipped pages

2 participants