pymupdf · jamie-lemon · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/changes.txt b/changes.txt
@@ -7,7 +7,7 @@ Change Log
 * Fixed issues:
 
   * **Fixed** `4902 <https://github.com/pymupdf/PyMuPDF/issues/4902>`_: Incorrect linewidth in elements returned by Page.get_texttrace()
-  * **Fixed** `4932 <https://github.com/pymupdf/PyMuPDF/issues/4932>`_: `"Page" has no attribute "find_tables" in PyMuPDF 1.27
+  * **Fixed** `4932 <https://github.com/pymupdf/PyMuPDF/issues/4932>`_: "Page" has no attribute "find_tables" in PyMuPDF 1.27
 
 * Other:
 
@@ -20,12 +20,12 @@ Change Log
 
 * Fixed issues:
 
-  * **Fixed** `4903 <https://github.com/pymupdf/PyMuPDF/issues/4903>`_: Typing broken because of *_forward_decl
+  * **Fixed** `4903 <https://github.com/pymupdf/PyMuPDF/issues/4903>`_: Typing broken because of `*_forward_decl`
 
 * Other:
 
   * Retrospectively marked #4907 as fixed in pymupdf-1.27.1.
-  * Improved get_textpage_ocr().
+  * Improved `get_textpage_ocr()`.
 
     For partial OCR, **all** page areas outside legible text are now OCRed, not
     just those within images. This means that OCR will now also be performed

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -303,36 +303,8 @@ See :doc:`pyodide`.
 Enabling Integrated OCR Support
 ---------------------------------------------------------
 
-If you do not intend to use this feature, skip this step. Otherwise, it is required for both installation paths: **from wheels and from sources.**
+PyMuPDF's OCR features rely on the Tesseract OCR engine which is included by default in your installation. It includes the English language pack by default. To install additional Tesseract language packs to enable OCR for languages other than English, see :ref:`Tesseract Language Packs <tesseract-language-packs>` for instructions on how to do this on different platforms.
 
-PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need `Tesseract’s language support data <https://github.com/tesseract-ocr/tessdata>`_.
 
-If not specified explicitly, PyMuPDF will attempt to find the installed
-Tesseract's tessdata, but this should probably not be relied upon.
-
-Otherwise PyMuPDF requires that Tesseract's language support folder is
-specified explicitly either in PyMuPDF OCR functions' `tessdata` arguments or
-`os.environ["TESSDATA_PREFIX"]`.
-
-So for a working OCR functionality, make sure to complete this checklist:
-
-1. Locate Tesseract's language support folder. Typically you will find it here:
-
-   * Windows: `C:/Program Files/Tesseract-OCR/tessdata`
-   * Unix systems: `/usr/share/tesseract-ocr/4.00/tessdata`
-
-2. Specify the language support folder when calling PyMuPDF OCR functions:
-
-   * Set the `tessdata` argument.
-   * Or set `os.environ["TESSDATA_PREFIX"]` from within Python.
-   * Or set environment variable `TESSDATA_PREFIX` before running Python, for example:
-
-     * Windows: `setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"`
-     * Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata`
-
-
-.. note::
-
-  Find out more on the `official documentation for installing Tesseract website <https://tesseract-ocr.github.io/tessdoc/Installation.html>`_.
 
 .. include:: footer.rst
diff --git a/docs/ocr/tesseract-language-packs.rst b/docs/ocr/tesseract-language-packs.rst
@@ -0,0 +1,251 @@
+
+.. include:: ../header.rst
+
+.. _pymupdf-pro:
+
+.. raw:: html
+
+    <script>
+        document.getElementById("headerSearchWidget").action = '../search.html';
+    </script>
+
+
+.. _tesseract-language-packs:
+
+Tesseract Language Packs
+========================
+
+.. meta::
+   :description: How to install additional Tesseract language packs on macOS, Linux, and Windows.
+
+Overview
+--------
+
+Tesseract identifies languages using three-letter `ISO 639-2 <https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes>`_ codes. English (``eng``) is installed by default on most platforms. For any other language, you need to install the corresponding language pack before pymupdf4llm can use it for OCR.
+
+A full list of supported language codes is available on the `Tesseract tessdata repository <https://github.com/tesseract-ocr/tessdata>`_.
+
+.. tip::
+
+   To see which languages are already installed on your system, run ``tesseract --list-langs`` in your terminal.
+
+----
+
+Linux
+-----
+
+Language pack installation varies slightly by distribution.
+
+**Ubuntu / Debian**
+
+.. code-block:: bash
+
+   # List all available language packs
+   apt-cache search tesseract-ocr
+
+   # Install a specific language (e.g. German)
+   sudo apt install tesseract-ocr-deu
+
+   # Install all available languages at once
+   sudo apt install tesseract-ocr-all
+
+Language packages follow the naming pattern ``tesseract-ocr-<langcode>``, for example ``tesseract-ocr-fra`` for French or ``tesseract-ocr-chi-sim`` for Simplified Chinese.
+
+**Fedora / RHEL**
+
+.. code-block:: bash
+
+   # Search for available language packs
+   dnf search tesseract
+
+   # Install a specific language (e.g. German)
+   sudo dnf install tesseract-langpack-deu
+
+   # Install all language packs
+   sudo dnf install tesseract-langpack-*
+
+On Fedora, packages are named ``tesseract-langpack-<langcode>``.
+
+**Arch Linux**
+
+.. code-block:: bash
+
+   # Search for available language packs
+   pacman -Ss tesseract-data
+
+   # Install a specific language (e.g. German)
+   sudo pacman -S tesseract-data-deu
+
+On Arch, packages are named ``tesseract-data-<langcode>``.
+
+Manual Installation (All Distros)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a language pack is not available through your package manager, download the ``.traineddata`` file directly from GitHub and copy it to your Tesseract data directory:
+
+.. code-block:: bash
+
+   # Download language pack (e.g. French)
+   curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
+     -o fra.traineddata
+
+   # Copy to tessdata directory (path varies by distro)
+   sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
+   # or
+   sudo cp fra.traineddata /usr/share/tessdata/
+
+Common tessdata locations on Linux:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Distribution
+     - Path
+   * - Ubuntu / Debian
+     - ``/usr/share/tesseract-ocr/4.00/tessdata/``
+   * - Fedora / RHEL
+     - ``/usr/share/tesseract/tessdata/``
+   * - Arch Linux
+     - ``/usr/share/tessdata/``
+
+----
+
+Windows
+-------
+
+During Installation (Recommended)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Tesseract Windows installer from `UB Mannheim <https://github.com/UB-Mannheim/tesseract/wiki>`_ lets you select additional language packs during setup. When you reach the **Choose Components** screen, expand **Additional language data** and tick the languages you need.
+
+After Installation (Manual)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If Tesseract is already installed, download language packs manually:
+
+1. Go to `github.com/tesseract-ocr/tessdata <https://github.com/tesseract-ocr/tessdata>`_
+2. Download the ``.traineddata`` file for your language (e.g. ``fra.traineddata`` for French)
+3. Copy the file into your Tesseract ``tessdata`` folder, typically:
+
+.. code-block:: text
+
+   C:\Program Files\Tesseract-OCR\tessdata\
+
+.. note::
+
+   The Chocolatey (``choco install tesseract``) package only includes English. All additional languages must be added manually using the steps above.
+
+Verify the Install
+~~~~~~~~~~~~~~~~~~
+
+Open Command Prompt or PowerShell and run:
+
+.. code-block:: powershell
+
+   tesseract --list-langs
+
+Your newly installed language should appear in the output.
+
+----
+
+macOS
+-----
+
+The recommended approach on macOS is `Homebrew <https://brew.sh>`_. There are two options depending on how much disk space you want to use.
+
+Install All Languages at Once
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``tesseract-lang`` formula bundles Tesseract with every available language pack:
+
+.. code-block:: bash
+
+   brew install tesseract-lang
+
+Install Specific Languages
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you only need a few languages, install ``tesseract`` first and then manually download the ``.traineddata`` files you need:
+
+.. code-block:: bash
+
+   # Install Tesseract engine only
+   brew install tesseract
+
+   # Find the tessdata directory
+   brew info tesseract
+   # Look for a line like: /opt/homebrew/share/tessdata
+
+   # Download a specific language pack (e.g. French)
+   curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
+     -o /opt/homebrew/share/tessdata/fra.traineddata
+
+Replace ``fra`` with your target language code and adjust the tessdata path to match what ``brew info tesseract`` reports on your machine.
+
+.. note::
+
+   If you installed Tesseract via MacPorts instead of Homebrew, use ``port install tesseract-<langcode>``, for example ``sudo port install tesseract-fra``.
+
+----
+
+Using a Language with pymupdf4llm
+----------------------------------
+
+Once a language pack is installed, pass its code to ``to_markdown()`` via the ``ocr_language`` parameter:
+
+.. code-block:: python
+
+   import pymupdf4llm
+
+   # Single language
+   md = pymupdf4llm.to_markdown("document.pdf", ocr_language="fra")
+
+   # Multiple languages
+   md = pymupdf4llm.to_markdown("document.pdf", ocr_language="eng+fra+deu")
+
+----
+
+Common Language Codes
+---------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 50 50
+
+   * - Language
+     - Code
+   * - English
+     - ``eng``
+   * - French
+     - ``fra``
+   * - German
+     - ``deu``
+   * - Spanish
+     - ``spa``
+   * - Italian
+     - ``ita``
+   * - Portuguese
+     - ``por``
+   * - Simplified Chinese
+     - ``chi_sim``
+   * - Traditional Chinese
+     - ``chi_tra``
+   * - Japanese
+     - ``jpn``
+   * - Korean
+     - ``kor``
+   * - Arabic
+     - ``ara``
+   * - Russian
+     - ``rus``
+   * - Hindi
+     - ``hin``
+
+For the full list of supported languages and their codes, see the `Tesseract tessdata repository <https://github.com/tesseract-ocr/tessdata>`_.
+
+
+
+.. include:: ../footer.rst
+
+