Skip to content

New Feature: Add configurable HTML hook before PDF generation#128

Open
jdillard wants to merge 12 commits intouseblocks:mainfrom
jdillard:html-hook
Open

New Feature: Add configurable HTML hook before PDF generation#128
jdillard wants to merge 12 commits intouseblocks:mainfrom
jdillard:html-hook

Conversation

@jdillard
Copy link
Copy Markdown
Contributor

@jdillard jdillard commented Jan 28, 2026

Adds a configuration option so users can run a custom Python script to modify the HTML (via BeautifulSoup) before PDF generation (e.g. remove, add, or change HTML elements).

Usage

# conf.py
simplepdf_html_hook = "./hooks/pdf_hook.py"

# hooks/pdf_hook.py — must define a function named html_hook
from bs4 import BeautifulSoup

def html_hook(soup, app):
    """soup: BeautifulSoup of the index HTML; app: Sphinx application. Return modified soup."""
    # e.g. remove nav, add watermark
    return soup

docs preview: https://sphinx-simplepdf--128.org.readthedocs.build/en/128/configuration.html#simplepdf-html-hook

Changes

  • Added simplepdf_html_hook config option, a path to a Python script
    • The script must define a function named html_hook(soup, app) that returns a BeautifulSoup object.
  • Added _load_html_hook() and _execute_html_hook() methods to the builder
  • Refactored _toctree_fix() to work with BeautifulSoup objects directly (used in the same pipeline as the hook)
  • Added configuration documentation and updated changelog
  • Added tests/test_html_hook.py:
    • Valid hook: default → full SimplePDF build succeeds and produces a PDF.
    • Missing script: simplepdf_html_hook points at a non-existent file → ConfigError (not found).
    • Missing html_hook: script without a callable html_hook → ConfigError (html_hook).
    • Hook returns None: html_hook returns None → ExtensionError (returned None).

@jdillard jdillard marked this pull request as draft January 28, 2026 20:05
@jdillard jdillard changed the title Add hook to manipulate HTML Add configurable HTML hook before PDF generation Mar 18, 2026
@jdillard jdillard marked this pull request as ready for review March 18, 2026 01:13
@jdillard jdillard changed the title Add configurable HTML hook before PDF generation New Feature: Add configurable HTML hook before PDF generation Mar 18, 2026
Copy link
Copy Markdown
Collaborator

@patdhlk patdhlk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work 🙏 The design is sound. The _toctree_fix refactor to pass soup through a pipeline is a good cleanup. The hook contract is simple and useful.

Tighten up the error handling and this is ready.

To summarize:

  • Guard against spec_from_file_location returning None
  • Load the hook once in __init__, not on every call to _execute_html_hook
  • Make the None return an error, not a warning — forgetting return shouldn't silently no-op
  • Drop the version bump from this PR
  • Add a note that it's intentionally single-hook

Comment thread sphinx_simplepdf/builders/simplepdf.py Outdated
Comment on lines +205 to +208
spec = importlib.util.spec_from_file_location("simplepdf_hook", script_path)
module = importlib.util.module_from_spec(spec)
try:
spec.loader.exec_module(module)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're loading and executing an arbitrary Python file from a path in the config. This is fine — conf.py is already arbitrary Python, so you're not expanding the trust boundary. But there's a subtlety: you hardcode the module name to "simplepdf_hook". If someone has two different SimplePDF projects and Python somehow shares the process (think tox, or a monorepo build script), you'll shadow one with the other in sys.modules. You don't insert into sys.modules explicitly, so it's probably fine, but spec_from_file_location can be surprising here. Worth a comment, or use a unique name like f"simplepdf_hook_{id(self)}".

More importantly: spec_from_file_location can return None if the path is weird (e.g., a directory, a .pyc without source). You immediately do spec.loader.exec_module(module) — that's an AttributeError: 'NoneType' has no attribute 'loader' if spec is None. Check for it:

  spec = importlib.util.spec_from_file_location("simplepdf_hook", script_path)
  if spec is None or spec.loader is None:
      raise ConfigError(f"Cannot load module from: {script_path}")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address in 80c7ee5

Comment thread sphinx_simplepdf/builders/simplepdf.py Outdated
Comment on lines +247 to +252
if result is None:
logger.warning(
"simplepdf_html_hook returned None, using original HTML. "
"The hook should return a BeautifulSoup object."
)
return soup
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too forgiving. The most common reason html_hook returns None is the user forgot the return statement. Silently using the original HTML means their hook did nothing and they'll spend 20 minutes wondering why. Make this an error, not a warning. The user will thank you.

If you really want to be lenient, at least make the warning loud enough that it won't get buried:

  logger.warning(
      "simplepdf_html_hook returned None — did you forget 'return soup'? "
      "Falling back to unmodified HTML.",
      type="simplepdf", subtype="hook",
  )

But I'd just raise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6a8e2c8

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests. The project has no test harness, I know. But _load_html_hook is testable in isolation — give it a temp file with a function, assert it loads. Give it a file without html_hook, assert it raises. Give it a nonexistent path, assert it raises. Four lines each. Even without a formal test framework, a if __name__ == "__main__" smoke test at the bottom of the module would be better than nothing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged main, since the test framework was added, and added tests in eaf2a07

@jdillard
Copy link
Copy Markdown
Contributor Author

jdillard commented Apr 2, 2026

Thanks for the review! I responded to comments and here are the commits for the remaining items:

  • 5110ecb: Load the hook once in init, not on every call to _execute_html_hook
  • dfbc73f: Drop the version bump from this PR
  • 6ba694d: Add a note that it's intentionally single-hook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants