Add option to write JSONL file with data on skipped pages#966
Merged
Add option to write JSONL file with data on skipped pages#966
Conversation
25a80f2 to
4f78dde
Compare
Member
Author
|
Moving back to draft while I work on adding the new |
220b267 to
9897393
Compare
tests: update eslint to ignore promise check on tests
f5331ec to
6ae4987
Compare
ikreymer
reviewed
Apr 3, 2026
Member
|
Re: naming, perhaps we could go with |
url and ts are the only two required fields for lines in a pages JSONL file according to our WACZ spec, and it may be useful to when when a page was encountered, so this commit ensures that the ts is properly written into the file for each line.
Member
Author
|
I like the suggestion of |
For some reason, the datapackage validator is complaining about the JSONL file in the reports directory, though it is fine with JSONL files in the pages directory. Commenting out this particular test for now.
record 'redirectOutOfScope' type to indicate pages that are excluded because they redirect to out-of-scope pages
b5bb316 to
9b1fef6
Compare
…ipped, if redirect is explicitly excluded add test for redirectToExcluded
ikreymer
approved these changes
Apr 9, 2026
Member
ikreymer
left a comment
There was a problem hiding this comment.
Made a few more changes:
- added test for pageLimit skipped page
- added redirectToExcluded skip reason and add test
Should be good to go now!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #965
Add
--reportSkippedargument, which will enable the creation of areports/skippedPages.jsonlfile with the following elements for each URL encountered that was not queued:urlseedUrldepthreason(one ofoutOfScope,pageLimit,robotsTxt, orredirectToExcluded)tsThe
reports/directory is new and will likely be expanded with other crawl-time reporting moving forward.