fix: resolve TESSDATA_PREFIX path correctly for all Tesseract versions#2257
fix: resolve TESSDATA_PREFIX path correctly for all Tesseract versions#2257DhanushVarma-2 wants to merge 1 commit intoCCExtractor:masterfrom
Conversation
476f4c5 to
6906ef9
Compare
Two bugs in init_ocr() in ocr.c: 1. The Tesseract 4/5 branch always blindly appended '/tessdata' to the path returned by probe_tessdata_location(). If TESSDATA_PREFIX was already set to a path ending in 'tessdata/', this caused a double- append e.g. '/usr/share/tessdata/tessdata'. 2. The legacy Tesseract <4 branch passed tessdata_path raw to TessBaseAPIInit4 without appending 'tessdata' at all, causing Tesseract to look for eng.traineddata directly in e.g. '/usr/share/' instead of '/usr/share/tessdata/'. Fix: normalize the path once before both branches. Detect whether the returned path already ends with 'tessdata' or 'tessdata/', handle Windows backslash separators, and use the resolved path in both Tesseract version branches. Add mprint diagnostic for the resolved path. Fixes CCExtractor#1492
6906ef9 to
71c5762
Compare
|
The format_rust CI failures are pre-existing on master . |
CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 395f9b3...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 395f9b3...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
There were two bugs in init_ocr():
The Tesseract 4/5 branch always appended /tessdata to the probed path — if TESSDATA_PREFIX was already set to point at the tessdata dir itself, this doubled it: /usr/share/tessdata/tessdata.
The legacy Tesseract <4 branch passed the raw probed path to TessBaseAPIInit4 with no /tessdata appended at all — so Tesseract looked for /usr/share/eng.traineddata instead of /usr/share/tessdata/eng.traineddata. This is the exact error in #1492.
Fix: build the tessdata path once before both branches — check if the path already ends with tessdata, otherwise append it. Windows backslash separators handled too. Both branches now use the same resolved path. Added an mprint line showing the resolved path to make future debugging easier.