Updates docs for 1.27.2.2 release. by jamie-lemon · Pull Request #4947 · pymupdf/PyMuPDF

jamie-lemon · 2026-03-20T15:22:02Z

Adds more comprehensive information about OCR and PyMuPDF4LLM.

Adds more comprehensive information about OCR.

docs/ocr/tesseract-language-packs.rst

docs/pymupdf4llm/index.rst

JorjMcKie

There is one minor comment about a misleading formulation ...

julian-smith-artifex-com · 2026-03-20T23:16:42Z

docs/pymupdf4llm/index.rst

+   md_text = pymupdf4llm.to_markdown("multilingual.pdf",
+                                      ocr_language="eng+deu")
+
+Tesseract language packs must be installed separately on your system. For example, on Ubuntu:


Possibly remove the word separately from this sentence?

julian-smith-artifex-com · 2026-03-20T23:18:20Z

docs/pymupdf4llm/ocr-plugins.rst

-The default plugins are designed to be used as is, without any need for configuration. However, if you want to use a specific plugin, you can do so by using the following approach (which enforces for instance using RapidOCR and skipping above selection process). Please note that all plugins have a function named `exec_ocr` that does the actual OCR::
+The default plugins are designed to be used as is, without any need for configuration. 
+
+However, if you want to use a specific plugin, you can do so by using the following approach (which enforces for instance using RapidOCR and skipping above selection process). Please note that all plugins have a function named `exec_ocr` that does the actual OCR.


Very minor, but suggest changing

skipping above selection process

to

skipping the above selection process

julian-smith-artifex-com · 2026-03-20T23:20:41Z

docs/pymupdf4llm/ocr-plugins.rst

+
+If you want to use both OCR engines side-by-side, you can do so by implementing a custom OCR function which calls both OCR engines — one for bbox recognition (RapidOCR) and the other for text recognition (Tesseract) — and then combines their results.
+
+This pre-made callable OCR function can be found in the ``pymupdf4llm.ocr`` module as ``rapidtess_api.exec_ocr``.


Change rapidtess_api.exec_ocr to rapidtess_api.exec_ocr()? Otherwise it doesn't look like a function.

changes.txt

julian-smith-artifex-com · 2026-03-20T23:32:11Z

docs/installation.rst

 ---------------------------------------------------------

-If you do not intend to use this feature, skip this step. Otherwise, it is required for both installation paths: **from wheels and from sources.**
+PyMuPDF's OCR features rely on the Tesseract OCR engine which is included by default in your installation. It includes the English language pack by default. To install additional Tesseract language packs to enable OCR for languages other than English, see :ref:`Tesseract Language Packs <tesseract-language-packs>` for instructions on how to do this on different platforms.


I'm confused - this seems to suggest that the pymupdf4llm wheel contains the Tesseract English language pack, which i don't think it the case?

In practice pymupdf uses various ways to locate tesseract language packs, including running tesseract command-line programmes, and i don't understand why this PR removes information about this.

Updates docs for 1.27.2.2 release.

24e4e1c

Adds more comprehensive information about OCR.

jamie-lemon requested review from JorjMcKie and julian-smith-artifex-com March 20, 2026 15:22

JorjMcKie reviewed Mar 20, 2026

View reviewed changes

docs/ocr/tesseract-language-packs.rst Outdated Show resolved Hide resolved

jamie-lemon and others added 2 commits March 20, 2026 15:40

Merge branch 'main' into docs-general-updates

0e5b0d0

Links OCR section to OCR Plugins page and some tidy up.

4182a46

JorjMcKie reviewed Mar 20, 2026

View reviewed changes

docs/pymupdf4llm/index.rst Outdated Show resolved Hide resolved

JorjMcKie reviewed Mar 20, 2026

View reviewed changes

Some small corrections to the OCR page details.

1059bc4

JorjMcKie approved these changes Mar 20, 2026

View reviewed changes

julian-smith-artifex-com reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates docs for 1.27.2.2 release.#4947

Updates docs for 1.27.2.2 release.#4947
jamie-lemon wants to merge 4 commits intomainfrom
docs-general-updates

jamie-lemon commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

JorjMcKie left a comment

Uh oh!

julian-smith-artifex-com Mar 20, 2026

Uh oh!

julian-smith-artifex-com Mar 20, 2026

Uh oh!

julian-smith-artifex-com Mar 20, 2026

Uh oh!

Uh oh!

julian-smith-artifex-com Mar 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		If you want to use both OCR engines side-by-side, you can do so by implementing a custom OCR function which calls both OCR engines — one for bbox recognition (RapidOCR) and the other for text recognition (Tesseract) — and then combines their results.

		This pre-made callable OCR function can be found in the ``pymupdf4llm.ocr`` module as ``rapidtess_api.exec_ocr``.

Conversation

jamie-lemon commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

JorjMcKie left a comment

Choose a reason for hiding this comment

Uh oh!

julian-smith-artifex-com Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

julian-smith-artifex-com Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

julian-smith-artifex-com Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

julian-smith-artifex-com Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

julian-smith-artifex-com Mar 20, 2026 •

edited

Loading