Skip to content

Updates docs for 1.27.2.2 release.#4947

Open
jamie-lemon wants to merge 4 commits intomainfrom
docs-general-updates
Open

Updates docs for 1.27.2.2 release.#4947
jamie-lemon wants to merge 4 commits intomainfrom
docs-general-updates

Conversation

@jamie-lemon
Copy link
Collaborator

Adds more comprehensive information about OCR and PyMuPDF4LLM.

Adds more comprehensive information about OCR.
Copy link
Collaborator

@JorjMcKie JorjMcKie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one minor comment about a misleading formulation ...

md_text = pymupdf4llm.to_markdown("multilingual.pdf",
ocr_language="eng+deu")

Tesseract language packs must be installed separately on your system. For example, on Ubuntu:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly remove the word separately from this sentence?

The default plugins are designed to be used as is, without any need for configuration. However, if you want to use a specific plugin, you can do so by using the following approach (which enforces for instance using RapidOCR and skipping above selection process). Please note that all plugins have a function named `exec_ocr` that does the actual OCR::
The default plugins are designed to be used as is, without any need for configuration.

However, if you want to use a specific plugin, you can do so by using the following approach (which enforces for instance using RapidOCR and skipping above selection process). Please note that all plugins have a function named `exec_ocr` that does the actual OCR.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but suggest changing

skipping above selection process

to

skipping the above selection process


If you want to use both OCR engines side-by-side, you can do so by implementing a custom OCR function which calls both OCR engines — one for bbox recognition (RapidOCR) and the other for text recognition (Tesseract) — and then combines their results.

This pre-made callable OCR function can be found in the ``pymupdf4llm.ocr`` module as ``rapidtess_api.exec_ocr``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change rapidtess_api.exec_ocr to rapidtess_api.exec_ocr()? Otherwise it doesn't look like a function.

---------------------------------------------------------

If you do not intend to use this feature, skip this step. Otherwise, it is required for both installation paths: **from wheels and from sources.**
PyMuPDF's OCR features rely on the Tesseract OCR engine which is included by default in your installation. It includes the English language pack by default. To install additional Tesseract language packs to enable OCR for languages other than English, see :ref:`Tesseract Language Packs <tesseract-language-packs>` for instructions on how to do this on different platforms.
Copy link
Collaborator

@julian-smith-artifex-com julian-smith-artifex-com Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - this seems to suggest that the pymupdf4llm wheel contains the Tesseract English language pack, which i don't think it the case?

In practice pymupdf uses various ways to locate tesseract language packs, including running tesseract command-line programmes, and i don't understand why this PR removes information about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants