New Feature: Add configurable HTML hook before PDF generation by jdillard · Pull Request #128 · useblocks/sphinx-simplepdf

jdillard · 2026-01-28T20:02:38Z

Adds a configuration option so users can run a custom Python script to modify the HTML (via BeautifulSoup) before PDF generation (e.g. remove, add, or change HTML elements).

Usage

# conf.py
simplepdf_html_hook = "./hooks/pdf_hook.py"

# hooks/pdf_hook.py — must define a function named html_hook
from bs4 import BeautifulSoup

def html_hook(soup, app):
    """soup: BeautifulSoup of the index HTML; app: Sphinx application. Return modified soup."""
    # e.g. remove nav, add watermark
    return soup

docs preview: https://sphinx-simplepdf--128.org.readthedocs.build/en/128/configuration.html#simplepdf-html-hook

Changes

Added simplepdf_html_hook config option, a path to a Python script
- The script must define a function named html_hook(soup, app) that returns a BeautifulSoup object.
Added _load_html_hook() and _execute_html_hook() methods to the builder
Refactored _toctree_fix() to work with BeautifulSoup objects directly (used in the same pipeline as the hook)
Added configuration documentation and updated changelog
Added tests/test_html_hook.py:
- Valid hook: default → full SimplePDF build succeeds and produces a PDF.
- Missing script: simplepdf_html_hook points at a non-existent file → ConfigError (not found).
- Missing html_hook: script without a callable html_hook → ConfigError (html_hook).
- Hook returns None: html_hook returns None → ExtensionError (returned None).

patdhlk

Thanks for your work 🙏 The design is sound. The _toctree_fix refactor to pass soup through a pipeline is a good cleanup. The hook contract is simple and useful.

Tighten up the error handling and this is ready.

To summarize:

Guard against spec_from_file_location returning None
Load the hook once in __init__, not on every call to _execute_html_hook
Make the None return an error, not a warning — forgetting return shouldn't silently no-op
Drop the version bump from this PR
Add a note that it's intentionally single-hook

patdhlk · 2026-04-01T16:55:12Z

+        spec = importlib.util.spec_from_file_location("simplepdf_hook", script_path)
+        module = importlib.util.module_from_spec(spec)
+        try:
+            spec.loader.exec_module(module)


You're loading and executing an arbitrary Python file from a path in the config. This is fine — conf.py is already arbitrary Python, so you're not expanding the trust boundary. But there's a subtlety: you hardcode the module name to "simplepdf_hook". If someone has two different SimplePDF projects and Python somehow shares the process (think tox, or a monorepo build script), you'll shadow one with the other in sys.modules. You don't insert into sys.modules explicitly, so it's probably fine, but spec_from_file_location can be surprising here. Worth a comment, or use a unique name like f"simplepdf_hook_{id(self)}".

More importantly: spec_from_file_location can return None if the path is weird (e.g., a directory, a .pyc without source). You immediately do spec.loader.exec_module(module) — that's an AttributeError: 'NoneType' has no attribute 'loader' if spec is None. Check for it:

spec = importlib.util.spec_from_file_location("simplepdf_hook", script_path) if spec is None or spec.loader is None: raise ConfigError(f"Cannot load module from: {script_path}")

Address in 80c7ee5

patdhlk · 2026-04-01T16:58:39Z

+        if result is None:
+            logger.warning(
+                "simplepdf_html_hook returned None, using original HTML. "
+                "The hook should return a BeautifulSoup object."
+            )
+            return soup


This is too forgiving. The most common reason html_hook returns None is the user forgot the return statement. Silently using the original HTML means their hook did nothing and they'll spend 20 minutes wondering why. Make this an error, not a warning. The user will thank you.

If you really want to be lenient, at least make the warning loud enough that it won't get buried:

logger.warning( "simplepdf_html_hook returned None — did you forget 'return soup'? " "Falling back to unmodified HTML.", type="simplepdf", subtype="hook", )

But I'd just raise.

Addressed in 6a8e2c8

patdhlk · 2026-04-01T17:08:55Z

No tests. The project has no test harness, I know. But _load_html_hook is testable in isolation — give it a temp file with a function, assert it loads. Give it a file without html_hook, assert it raises. Give it a nonexistent path, assert it raises. Four lines each. Even without a formal test framework, a if __name__ == "__main__" smoke test at the bottom of the module would be better than nothing.

I merged main, since the test framework was added, and added tests in eaf2a07

…module name

jdillard · 2026-04-02T21:27:21Z

Thanks for the review! I responded to comments and here are the commits for the remaining items:

5110ecb: Load the hook once in init, not on every call to _execute_html_hook
dfbc73f: Drop the version bump from this PR
6ba694d: Add a note that it's intentionally single-hook

Add hook to manipulate HTML

30e14d8

jdillard marked this pull request as draft January 28, 2026 20:05

use a set function name

0ff5424

jdillard changed the title ~~Add hook to manipulate HTML~~ Add configurable HTML hook before PDF generation Mar 18, 2026

jdillard marked this pull request as ready for review March 18, 2026 01:13

Merge branch 'main' into html-hook

a297fc7

jdillard changed the title ~~Add configurable HTML hook before PDF generation~~ New Feature: Add configurable HTML hook before PDF generation Mar 18, 2026

patdhlk requested changes Apr 1, 2026

View reviewed changes

jdillard added 9 commits April 1, 2026 17:55

Merge branch 'main' into html-hook

234b9ee

Drop the version bump from this PR

dfbc73f

run ruff formatting

efb5519

Make the None return an error, not a warning

6a8e2c8

Load the hook once in __init__, not on every call to _execute_html_hook

5110ecb

Guard against spec_from_file_location returning None and use virtual …

80c7ee5

…module name

fix mypy and ruff now that i have pre-commit installed

b229cc0

Add tests

eaf2a07

Add a note that it's intentionally single-hook

6ba694d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Feature: Add configurable HTML hook before PDF generation#128

New Feature: Add configurable HTML hook before PDF generation#128
jdillard wants to merge 12 commits intouseblocks:mainfrom
jdillard:html-hook

jdillard commented Jan 28, 2026 •

edited

Loading

Uh oh!

patdhlk left a comment

Uh oh!

patdhlk Apr 1, 2026

Uh oh!

jdillard Apr 2, 2026

Uh oh!

patdhlk Apr 1, 2026

Uh oh!

jdillard Apr 2, 2026

Uh oh!

patdhlk Apr 1, 2026

Uh oh!

jdillard Apr 2, 2026

Uh oh!

jdillard commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jdillard commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Changes

Uh oh!

patdhlk left a comment

Choose a reason for hiding this comment

Uh oh!

patdhlk Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jdillard Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

patdhlk Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jdillard Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

patdhlk Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jdillard Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

jdillard commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jdillard commented Jan 28, 2026 •

edited

Loading