New Feature: Add configurable HTML hook before PDF generation#128
New Feature: Add configurable HTML hook before PDF generation#128jdillard wants to merge 12 commits intouseblocks:mainfrom
Conversation
patdhlk
left a comment
There was a problem hiding this comment.
Thanks for your work 🙏 The design is sound. The _toctree_fix refactor to pass soup through a pipeline is a good cleanup. The hook contract is simple and useful.
Tighten up the error handling and this is ready.
To summarize:
- Guard against
spec_from_file_locationreturningNone - Load the hook once in
__init__, not on every call to_execute_html_hook - Make the
None returnan error, not a warning — forgetting return shouldn't silently no-op - Drop the version bump from this PR
- Add a note that it's intentionally single-hook
| spec = importlib.util.spec_from_file_location("simplepdf_hook", script_path) | ||
| module = importlib.util.module_from_spec(spec) | ||
| try: | ||
| spec.loader.exec_module(module) |
There was a problem hiding this comment.
You're loading and executing an arbitrary Python file from a path in the config. This is fine — conf.py is already arbitrary Python, so you're not expanding the trust boundary. But there's a subtlety: you hardcode the module name to "simplepdf_hook". If someone has two different SimplePDF projects and Python somehow shares the process (think tox, or a monorepo build script), you'll shadow one with the other in sys.modules. You don't insert into sys.modules explicitly, so it's probably fine, but spec_from_file_location can be surprising here. Worth a comment, or use a unique name like f"simplepdf_hook_{id(self)}".
More importantly: spec_from_file_location can return None if the path is weird (e.g., a directory, a .pyc without source). You immediately do spec.loader.exec_module(module) — that's an AttributeError: 'NoneType' has no attribute 'loader' if spec is None. Check for it:
spec = importlib.util.spec_from_file_location("simplepdf_hook", script_path)
if spec is None or spec.loader is None:
raise ConfigError(f"Cannot load module from: {script_path}")
| if result is None: | ||
| logger.warning( | ||
| "simplepdf_html_hook returned None, using original HTML. " | ||
| "The hook should return a BeautifulSoup object." | ||
| ) | ||
| return soup |
There was a problem hiding this comment.
This is too forgiving. The most common reason html_hook returns None is the user forgot the return statement. Silently using the original HTML means their hook did nothing and they'll spend 20 minutes wondering why. Make this an error, not a warning. The user will thank you.
If you really want to be lenient, at least make the warning loud enough that it won't get buried:
logger.warning(
"simplepdf_html_hook returned None — did you forget 'return soup'? "
"Falling back to unmodified HTML.",
type="simplepdf", subtype="hook",
)
But I'd just raise.
There was a problem hiding this comment.
No tests. The project has no test harness, I know. But _load_html_hook is testable in isolation — give it a temp file with a function, assert it loads. Give it a file without html_hook, assert it raises. Give it a nonexistent path, assert it raises. Four lines each. Even without a formal test framework, a if __name__ == "__main__" smoke test at the bottom of the module would be better than nothing.
There was a problem hiding this comment.
I merged main, since the test framework was added, and added tests in eaf2a07
Adds a configuration option so users can run a custom Python script to modify the HTML (via BeautifulSoup) before PDF generation (e.g. remove, add, or change HTML elements).
Usage
docs preview: https://sphinx-simplepdf--128.org.readthedocs.build/en/128/configuration.html#simplepdf-html-hook
Changes
simplepdf_html_hookconfig option, a path to a Python scripthtml_hook(soup, app)that returns aBeautifulSoupobject._load_html_hook()and_execute_html_hook()methods to the builder_toctree_fix()to work with BeautifulSoup objects directly (used in the same pipeline as the hook)tests/test_html_hook.py: