Conversation
Adds more comprehensive information about OCR.
JorjMcKie
left a comment
There was a problem hiding this comment.
There is one minor comment about a misleading formulation ...
| md_text = pymupdf4llm.to_markdown("multilingual.pdf", | ||
| ocr_language="eng+deu") | ||
|
|
||
| Tesseract language packs must be installed separately on your system. For example, on Ubuntu: |
There was a problem hiding this comment.
Possibly remove the word separately from this sentence?
| The default plugins are designed to be used as is, without any need for configuration. However, if you want to use a specific plugin, you can do so by using the following approach (which enforces for instance using RapidOCR and skipping above selection process). Please note that all plugins have a function named `exec_ocr` that does the actual OCR:: | ||
| The default plugins are designed to be used as is, without any need for configuration. | ||
|
|
||
| However, if you want to use a specific plugin, you can do so by using the following approach (which enforces for instance using RapidOCR and skipping above selection process). Please note that all plugins have a function named `exec_ocr` that does the actual OCR. |
There was a problem hiding this comment.
Very minor, but suggest changing
skipping above selection process
to
skipping the above selection process
|
|
||
| If you want to use both OCR engines side-by-side, you can do so by implementing a custom OCR function which calls both OCR engines — one for bbox recognition (RapidOCR) and the other for text recognition (Tesseract) — and then combines their results. | ||
|
|
||
| This pre-made callable OCR function can be found in the ``pymupdf4llm.ocr`` module as ``rapidtess_api.exec_ocr``. |
There was a problem hiding this comment.
Change rapidtess_api.exec_ocr to rapidtess_api.exec_ocr()? Otherwise it doesn't look like a function.
| --------------------------------------------------------- | ||
|
|
||
| If you do not intend to use this feature, skip this step. Otherwise, it is required for both installation paths: **from wheels and from sources.** | ||
| PyMuPDF's OCR features rely on the Tesseract OCR engine which is included by default in your installation. It includes the English language pack by default. To install additional Tesseract language packs to enable OCR for languages other than English, see :ref:`Tesseract Language Packs <tesseract-language-packs>` for instructions on how to do this on different platforms. |
There was a problem hiding this comment.
I'm confused - this seems to suggest that the pymupdf4llm wheel contains the Tesseract English language pack, which i don't think it the case?
In practice pymupdf uses various ways to locate tesseract language packs, including running tesseract command-line programmes, and i don't understand why this PR removes information about this.
Adds more comprehensive information about OCR and PyMuPDF4LLM.