Skip to content

fix: resolve TESSDATA_PREFIX path correctly for all Tesseract versions#2257

Open
DhanushVarma-2 wants to merge 1 commit intoCCExtractor:masterfrom
DhanushVarma-2:fix/tessdata-path-2
Open

fix: resolve TESSDATA_PREFIX path correctly for all Tesseract versions#2257
DhanushVarma-2 wants to merge 1 commit intoCCExtractor:masterfrom
DhanushVarma-2:fix/tessdata-path-2

Conversation

@DhanushVarma-2
Copy link
Copy Markdown
Contributor

@DhanushVarma-2 DhanushVarma-2 commented Apr 4, 2026

There were two bugs in init_ocr():

The Tesseract 4/5 branch always appended /tessdata to the probed path — if TESSDATA_PREFIX was already set to point at the tessdata dir itself, this doubled it: /usr/share/tessdata/tessdata.
The legacy Tesseract <4 branch passed the raw probed path to TessBaseAPIInit4 with no /tessdata appended at all — so Tesseract looked for /usr/share/eng.traineddata instead of /usr/share/tessdata/eng.traineddata. This is the exact error in #1492.

Fix: build the tessdata path once before both branches — check if the path already ends with tessdata, otherwise append it. Windows backslash separators handled too. Both branches now use the same resolved path. Added an mprint line showing the resolved path to make future debugging easier.

Two bugs in init_ocr() in ocr.c:

1. The Tesseract 4/5 branch always blindly appended '/tessdata' to the
   path returned by probe_tessdata_location(). If TESSDATA_PREFIX was
   already set to a path ending in 'tessdata/', this caused a double-
   append e.g. '/usr/share/tessdata/tessdata'.

2. The legacy Tesseract <4 branch passed tessdata_path raw to
   TessBaseAPIInit4 without appending 'tessdata' at all, causing
   Tesseract to look for eng.traineddata directly in e.g. '/usr/share/'
   instead of '/usr/share/tessdata/'.

Fix: normalize the path once before both branches. Detect whether the
returned path already ends with 'tessdata' or 'tessdata/', handle
Windows backslash separators, and use the resolved path in both
Tesseract version branches. Add mprint diagnostic for the resolved path.

Fixes CCExtractor#1492
@DhanushVarma-2
Copy link
Copy Markdown
Contributor Author

DhanushVarma-2 commented Apr 6, 2026

The format_rust CI failures are pre-existing on master .
src/rust/lib_ccxr/src/teletext.rs has a formatting issue unrelated to this PR. No files in src/rust/ were modified in this change.

@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 395f9b3...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 3/7
DVD 3/3
DVR-MS 2/2
General 20/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 77/86
Teletext 20/21
WTV 13/13
XDS 31/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b...
  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
  • ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9...
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc...
  • ccextractor --autoprogram --out=srt --latin1 b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --autoprogram --out=ttxt --latin1 01509e4d27...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --out=spupng c83f765c66...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e...
  • ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7...

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 395f9b3...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 4/7
DVD 3/3
DVR-MS 2/2
General 22/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 78/86
Teletext 20/21
WTV 13/13
XDS 31/34

Your PR breaks these cases:

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
  • ccextractor --out=spupng c83f765c66..., Last passed: Never
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants