diff --git a/changes.txt b/changes.txt index 672eab58e..804544e79 100644 --- a/changes.txt +++ b/changes.txt @@ -7,7 +7,7 @@ Change Log * Fixed issues: * **Fixed** `4902 `_: Incorrect linewidth in elements returned by Page.get_texttrace() - * **Fixed** `4932 `_: `"Page" has no attribute "find_tables" in PyMuPDF 1.27 + * **Fixed** `4932 `_: "Page" has no attribute "find_tables" in PyMuPDF 1.27 * Other: @@ -20,12 +20,12 @@ Change Log * Fixed issues: - * **Fixed** `4903 `_: Typing broken because of *_forward_decl + * **Fixed** `4903 `_: Typing broken because of `*_forward_decl` * Other: * Retrospectively marked #4907 as fixed in pymupdf-1.27.1. - * Improved get_textpage_ocr(). + * Improved `get_textpage_ocr()`. For partial OCR, **all** page areas outside legible text are now OCRed, not just those within images. This means that OCR will now also be performed diff --git a/docs/installation.rst b/docs/installation.rst index 03b8e5031..75c412ee8 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -303,36 +303,8 @@ See :doc:`pyodide`. Enabling Integrated OCR Support --------------------------------------------------------- -If you do not intend to use this feature, skip this step. Otherwise, it is required for both installation paths: **from wheels and from sources.** +PyMuPDF's OCR features rely on the Tesseract OCR engine which is included by default in your installation. It includes the English language pack by default. To install additional Tesseract language packs to enable OCR for languages other than English, see :ref:`Tesseract Language Packs ` for instructions on how to do this on different platforms. -PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need `Tesseract’s language support data `_. -If not specified explicitly, PyMuPDF will attempt to find the installed -Tesseract's tessdata, but this should probably not be relied upon. - -Otherwise PyMuPDF requires that Tesseract's language support folder is -specified explicitly either in PyMuPDF OCR functions' `tessdata` arguments or -`os.environ["TESSDATA_PREFIX"]`. - -So for a working OCR functionality, make sure to complete this checklist: - -1. Locate Tesseract's language support folder. Typically you will find it here: - - * Windows: `C:/Program Files/Tesseract-OCR/tessdata` - * Unix systems: `/usr/share/tesseract-ocr/4.00/tessdata` - -2. Specify the language support folder when calling PyMuPDF OCR functions: - - * Set the `tessdata` argument. - * Or set `os.environ["TESSDATA_PREFIX"]` from within Python. - * Or set environment variable `TESSDATA_PREFIX` before running Python, for example: - - * Windows: `setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"` - * Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata` - - -.. note:: - - Find out more on the `official documentation for installing Tesseract website `_. .. include:: footer.rst diff --git a/docs/ocr/tesseract-language-packs.rst b/docs/ocr/tesseract-language-packs.rst new file mode 100644 index 000000000..d57839bb8 --- /dev/null +++ b/docs/ocr/tesseract-language-packs.rst @@ -0,0 +1,251 @@ + +.. include:: ../header.rst + +.. _pymupdf-pro: + +.. raw:: html + + + + +.. _tesseract-language-packs: + +Tesseract Language Packs +======================== + +.. meta:: + :description: How to install additional Tesseract language packs on macOS, Linux, and Windows. + +Overview +-------- + +Tesseract identifies languages using three-letter `ISO 639-2 `_ codes. English (``eng``) is installed by default on most platforms. For any other language, you need to install the corresponding language pack before pymupdf4llm can use it for OCR. + +A full list of supported language codes is available on the `Tesseract tessdata repository `_. + +.. tip:: + + To see which languages are already installed on your system, run ``tesseract --list-langs`` in your terminal. + +---- + +Linux +----- + +Language pack installation varies slightly by distribution. + +**Ubuntu / Debian** + +.. code-block:: bash + + # List all available language packs + apt-cache search tesseract-ocr + + # Install a specific language (e.g. German) + sudo apt install tesseract-ocr-deu + + # Install all available languages at once + sudo apt install tesseract-ocr-all + +Language packages follow the naming pattern ``tesseract-ocr-``, for example ``tesseract-ocr-fra`` for French or ``tesseract-ocr-chi-sim`` for Simplified Chinese. + +**Fedora / RHEL** + +.. code-block:: bash + + # Search for available language packs + dnf search tesseract + + # Install a specific language (e.g. German) + sudo dnf install tesseract-langpack-deu + + # Install all language packs + sudo dnf install tesseract-langpack-* + +On Fedora, packages are named ``tesseract-langpack-``. + +**Arch Linux** + +.. code-block:: bash + + # Search for available language packs + pacman -Ss tesseract-data + + # Install a specific language (e.g. German) + sudo pacman -S tesseract-data-deu + +On Arch, packages are named ``tesseract-data-``. + +Manual Installation (All Distros) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a language pack is not available through your package manager, download the ``.traineddata`` file directly from GitHub and copy it to your Tesseract data directory: + +.. code-block:: bash + + # Download language pack (e.g. French) + curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \ + -o fra.traineddata + + # Copy to tessdata directory (path varies by distro) + sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/ + # or + sudo cp fra.traineddata /usr/share/tessdata/ + +Common tessdata locations on Linux: + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Distribution + - Path + * - Ubuntu / Debian + - ``/usr/share/tesseract-ocr/4.00/tessdata/`` + * - Fedora / RHEL + - ``/usr/share/tesseract/tessdata/`` + * - Arch Linux + - ``/usr/share/tessdata/`` + +---- + +Windows +------- + +During Installation (Recommended) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Tesseract Windows installer from `UB Mannheim `_ lets you select additional language packs during setup. When you reach the **Choose Components** screen, expand **Additional language data** and tick the languages you need. + +After Installation (Manual) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If Tesseract is already installed, download language packs manually: + +1. Go to `github.com/tesseract-ocr/tessdata `_ +2. Download the ``.traineddata`` file for your language (e.g. ``fra.traineddata`` for French) +3. Copy the file into your Tesseract ``tessdata`` folder, typically: + +.. code-block:: text + + C:\Program Files\Tesseract-OCR\tessdata\ + +.. note:: + + The Chocolatey (``choco install tesseract``) package only includes English. All additional languages must be added manually using the steps above. + +Verify the Install +~~~~~~~~~~~~~~~~~~ + +Open Command Prompt or PowerShell and run: + +.. code-block:: powershell + + tesseract --list-langs + +Your newly installed language should appear in the output. + +---- + +macOS +----- + +The recommended approach on macOS is `Homebrew `_. There are two options depending on how much disk space you want to use. + +Install All Languages at Once +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``tesseract-lang`` formula bundles Tesseract with every available language pack: + +.. code-block:: bash + + brew install tesseract-lang + +Install Specific Languages +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you only need a few languages, install ``tesseract`` first and then manually download the ``.traineddata`` files you need: + +.. code-block:: bash + + # Install Tesseract engine only + brew install tesseract + + # Find the tessdata directory + brew info tesseract + # Look for a line like: /opt/homebrew/share/tessdata + + # Download a specific language pack (e.g. French) + curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \ + -o /opt/homebrew/share/tessdata/fra.traineddata + +Replace ``fra`` with your target language code and adjust the tessdata path to match what ``brew info tesseract`` reports on your machine. + +.. note:: + + If you installed Tesseract via MacPorts instead of Homebrew, use ``port install tesseract-``, for example ``sudo port install tesseract-fra``. + +---- + +Using a Language with pymupdf4llm +---------------------------------- + +Once a language pack is installed, pass its code to ``to_markdown()`` via the ``ocr_language`` parameter: + +.. code-block:: python + + import pymupdf4llm + + # Single language + md = pymupdf4llm.to_markdown("document.pdf", ocr_language="fra") + + # Multiple languages + md = pymupdf4llm.to_markdown("document.pdf", ocr_language="eng+fra+deu") + +---- + +Common Language Codes +--------------------- + +.. list-table:: + :header-rows: 1 + :widths: 50 50 + + * - Language + - Code + * - English + - ``eng`` + * - French + - ``fra`` + * - German + - ``deu`` + * - Spanish + - ``spa`` + * - Italian + - ``ita`` + * - Portuguese + - ``por`` + * - Simplified Chinese + - ``chi_sim`` + * - Traditional Chinese + - ``chi_tra`` + * - Japanese + - ``jpn`` + * - Korean + - ``kor`` + * - Arabic + - ``ara`` + * - Russian + - ``rus`` + * - Hindi + - ``hin`` + +For the full list of supported languages and their codes, see the `Tesseract tessdata repository `_. + + + +.. include:: ../footer.rst + + diff --git a/docs/pymupdf4llm/index.rst b/docs/pymupdf4llm/index.rst index 83af4697a..9079471b9 100644 --- a/docs/pymupdf4llm/index.rst +++ b/docs/pymupdf4llm/index.rst @@ -17,7 +17,7 @@ PyMuPDF4LLM |PyMuPDF4LLM| is a lightweight extension for |PyMuPDF| that turns PDFs into clean, structured data with minimal setup. It includes layout analysis *without* any GPU requirement. -|PyMuPDF4LLM| is aimed to make it easier to extract document content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown `, :ref:`JSON ` and :ref:`TXT ` extraction, as well as :ref:`LlamaIndex ` and :ref:`LangChain ` integration. +|PyMuPDF4LLM| makes it easy to extract document content in the format you need for **LLM** & **RAG** environments. It supports structured data extraction to :ref:`Markdown `, :ref:`JSON ` and :ref:`TXT `, as well as :ref:`LlamaIndex ` and :ref:`LangChain ` integration. .. important:: @@ -180,6 +180,148 @@ PyMuPDF4LLM & PyMuPDF Layout By default |PyMuPDF4LLM| includes a `layout analysis module`_ to enhance output results. To disable this module you can do so by calling the :meth:`use_layout` method. + +OCR +-------- + +PyMuPDF4LLM includes built-in OCR support for scanned documents and image-based PDFs. By default, OCR runs **automatically** when needed — you don't have to opt in. For more control, you can force OCR on specific pages, disable it entirely, or swap in a different OCR engine using the adaptor interface. + +.. note:: + + If you want to use an OCR engine other than Tesseract, see :ref:`OCR Engines ` for details. + + +Hybrid OCR strategy +~~~~~~~~~~~~~~~~~~~~~~~~~ + +PyMuPDF4LLM applies OCR only when it is genuinely required to obtain the complete text of a PDF page. If a page already contains sufficient extractable text, OCR is skipped entirely — avoiding unnecessary work and eliminating the risk of degrading high-quality digital text. + +When OCR is needed, PyMuPDF4LLM automatically selects the most suitable OCR plugin available in the runtime environment, balancing detection accuracy with processing speed. + +Its built-in OCR plugins implement a Hybrid OCR strategy: only those regions lacking extractable, legible text are passed to the OCR engine. This selective approach typically reduces OCR processing time by around 50% while improving recognition accuracy, since the engine focuses exclusively on the problematic regions. The recognized text is then merged back into the original page, enriching it without disturbing existing digital content. + + +---- + + +Auto-OCR Behaviour +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +PyMuPDF4LLM inspects each page before extracting text. If a page contains **no selectable text** — meaning all content is rasterised into images — OCR is triggered automatically for that page. + +Pages that contain native text only are never sent through OCR. This keeps processing fast and avoids degrading already-clean text. + +.. code-block:: python + + import pymupdf4llm + + # OCR runs automatically on any page with no selectable text + md_text = pymupdf4llm.to_markdown("scanned-document.pdf") + +The resulting Markdown is seamless — pages extracted via OCR and pages extracted natively are combined into a single output with no distinction between them. + +---- + +How OCR is Triggered +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are two scenarios where OCR is applied automatically: + +**No text at all** — if a page contains roughly no text but is covered with images or many character-sized vectors, PyMuPDF4LLM checks whether text is *probably* detectable on the page. This distinguishes image-based text (e.g. a scanned document) from ordinary pictures like photographs. + +**Garbled text** — if a page does contain text but too many characters are unreadable (e.g. ``"�����"``), OCR is applied **for the affected text areas only**, not the full page. This preserves already-readable text, images, and vectors while recovering only what is broken. + + +---- + +Forcing OCR +~~~~~~~~~~~~~ + +In some cases you may want to force OCR even on pages that contain selectable text — for example, when the native text layer is corrupt, misencoded, or misaligned with the visual content. + +Use ``force_ocr=True`` to bypass the auto-detection check entirely: + +.. code-block:: python + + md_text = pymupdf4llm.to_markdown("document.pdf", force_ocr=True) + +.. warning:: + + Forcing OCR on clean, text-based PDFs will slow down processing significantly and may reduce output quality. Only use ``force_ocr=True`` when you have reason to distrust the native text layer. + +You can also force OCR on specific pages rather than the whole document: + +.. code-block:: python + + md_text = pymupdf4llm.to_markdown( + "document.pdf", + pages=[2, 3, 4], + force_ocr=True + ) + +---- + +Disabling OCR +~~~~~~~~~~~~~ + +To prevent OCR from running at all — even on pages with no selectable text — set ``use_ocr=False``: + +.. code-block:: python + + md_text = pymupdf4llm.to_markdown("document.pdf", use_ocr=False) + +Pages with no selectable text will return empty strings in this mode. This is useful when you know your documents are always text-based, or when you want to handle OCR yourself in a downstream step. + +---- + +.. _ocr-adaptors: +.. _ocr-engines: +.. _ocr-plugins: + +OCR Engines +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Other OCR Engines (OCR Adaptors or Plugins) can be used with PyMuPDF4LLM. + +See :doc:`ocr-plugins` for details on how to use different OCR engines with PyMuPDF4LLM, including Tesseract, RapidOCR, and how to implement your own custom OCR function. + + + +OCR Language Support +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When using the default Tesseract adaptor, you can specify one or more languages using Tesseract's language codes. + +Specify the language to be used by the Tesseract OCR engine. Default is ``"eng"`` (English). Make sure that the respective language data files are installed. Remember to use correct Tesseract language codes. Multiple languages can be specified by concatenating the respective codes with a plus sign ``"+"``, for example ``"eng+deu"`` for English and German. + +.. code-block:: python + + md_text = pymupdf4llm.to_markdown("multilingual.pdf", + ocr_language="eng+deu") + +Tesseract language packs must be installed separately on your system. For example, on Ubuntu: + +.. code-block:: bash + + sudo apt install tesseract-ocr-deu tesseract-ocr-fra + +See the page on :ref:`installing Tesseract language packs ` for further details. + +---- + +Performance Tips +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +OCR is the most compute-intensive part of the extraction pipeline. A few ways to keep it fast: + +- **Process only the pages you need** using the ``pages`` parameter to avoid running OCR on the entire document. +- **Cache results** — write the output to disk after the first run so you don't re-process the same file. +- **Use** ``force_ocr=False`` (the default) so clean pages skip OCR entirely. +- **Resize images before passing to OCR** — very high DPI scans can slow Tesseract down without improving accuracy. + +---- + + Further Resources ------------------- diff --git a/docs/pymupdf4llm/ocr-plugins.rst b/docs/pymupdf4llm/ocr-plugins.rst index 87b77060a..8aebe8e1d 100644 --- a/docs/pymupdf4llm/ocr-plugins.rst +++ b/docs/pymupdf4llm/ocr-plugins.rst @@ -1,7 +1,7 @@ .. include:: ../header.rst -Default OCR Functions +OCR Plugins ====================== PyMuPDF4LLM supports default OCR functions. They come in the form of plugins that are present in its `ocr` subpackage. They are based on currently 3 popular OCR engines, Tesseract OCR, RapidOCR and PaddleOCR. Some engines can be combined to make use of their strengths and mitigate their weaknesses. For example, Tesseract OCR is very good at **recognizing** text, while RapidOCR is better at **detecting** text bounding boxes in images with complex backgrounds. By combining the two engines, we can achieve better overall OCR results while at the same time also reducing the overall OCR processing time. @@ -45,7 +45,17 @@ It also increases the chances for a successful layout detection, because other o Forcing the Choice of a Default Plugin --------------------------------------- -The default plugins are designed to be used as is, without any need for configuration. However, if you want to use a specific plugin, you can do so by using the following approach (which enforces for instance using RapidOCR and skipping above selection process). Please note that all plugins have a function named `exec_ocr` that does the actual OCR:: +The default plugins are designed to be used as is, without any need for configuration. + +However, if you want to use a specific plugin, you can do so by using the following approach (which enforces for instance using RapidOCR and skipping above selection process). Please note that all plugins have a function named `exec_ocr` that does the actual OCR. + + +RapidOCR +~~~~~~~~~ + +If `RapidOCR `_ and the RapidOCR ONNX Runtime are available, you can use a pre-made callable OCR function for it, which is provided in the ``pymupdf4llm.ocr`` module as ``rapidocr_api.exec_ocr``. + +.. code-block:: python import pymupdf4llm from pymupdf4llm.ocr import rapidocr_api @@ -56,6 +66,40 @@ The default plugins are designed to be used as is, without any need for configur md_text = pymupdf4llm.to_markdown("input.pdf", ocr_function=my_ocr_function) +RapidOCR & Tesseract Side-by-Side +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you want to use both OCR engines side-by-side, you can do so by implementing a custom OCR function which calls both OCR engines — one for bbox recognition (RapidOCR) and the other for text recognition (Tesseract) — and then combines their results. + +This pre-made callable OCR function can be found in the ``pymupdf4llm.ocr`` module as ``rapidtess_api.exec_ocr``. + +**Example** + +.. code-block:: python + + from pymupdf4llm.ocr import rapidtess_api + + md = pymupdf4llm.to_markdown( + doc, + ocr_function=rapidtess_api.exec_ocr, + force_ocr=True + ) + +.. list-table:: + :header-rows: 1 + :widths: 35 25 40 + + * - Adaptor + - Engines + - Notes + * - ``rapidocr_api.exec_ocr`` + - RapidOCR + - Requires RapidOCR and ONNX Runtime + * - ``rapidtess_api.exec_ocr`` + - RapidOCR & Tesseract + - Better accuracy for bounding box detection and text recognition + + Providing your Own Plugin -------------------------