Tesseract problem in self compiled Sikuli

Asked by j

I have exactly the problem described here: https://bugs.launchpad.net/sikuli/+bug/702625

As soon as I run my compiled sikuli-ide and try to capture an image, Sikuli crashes and the following message appears in the shell: "Unable to load unicharset file /tmp/sikuli/tessdata/eng.unicharset". As described in the Bugreport, the file exists but is empty.

My tesseract directory is in "/usr/local/share/tessdata" and contains a lot of files like eng.word-dawg, ent.inttemp and so on, and the directories "configs" and "tessconfigs". My tesseract is version 2.04 and I compiled it myself. "make install" creates the directories and files in this directory.

The path is already set in cmake_moduls/common.cmake:

FIND_PATH(TESSERACT_DATA_DIR confsets
   "/opt/local/share/tessdata"
   "/usr/local/share/tessdata"
   "/usr/share/tesseract-ocr/tessdata"
   "/usr/share/tesseract/tessdata"
   "/usr/share/tessdata"
   "c:/tesseract-2.04/tessdata"
)

In the linked bugreport, it is suggested to delete "sikuli-script/target/jar/tessdata", I do not have this directory, the whole "/jar/" directory is missing.

When I compile Sikuli-script, the cmake output looks like this:

Tesseract-OCR Data Path: /usr/local/share/tessdata
-- checking for module 'opencv'
-- found opencv, version 2.1.0
-- Found JNI: /usr/lib/jvm/default-java/jre/lib/i386/libjawt.so
-- Found SWIG: /usr/local/bin/swig (found version "1.3.40")
-- checking for module 'Tesseract'
-- package 'Tesseract' not found
-- Found Tesseract
NATIVE_LIBS: VisionProxy;VDictProxy

And later:

Scanning dependencies of target sikuli-script.jar.tessdata-in-jar
[ 65%] Copying Tesseract Data
[ 65%] Built target sikuli-script.jar.tessdata-in-jar

After that, in Sikuli-ide, it looks like this:

Tesseract-OCR Data Path: /usr/local/share/tessdata
-- Found Java: /usr/bin/java (found suitable version "1.6.0.29", required is "1.6")

Do you have any idea what I'm doing wrong and how to fix it?

Thanks in advance

Question information

Language:
English Edit question
Status:
Solved
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Solved by:
j
Solved:
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

Does your built sikuli-script.jar contain a tessdata folder?

Did you follow the recommended build approach?
- make a folder build in src/sikuli-script
- in the folder: cmake ..

if anything goes wrong, delete the folder build and start all over again after having made any modifications.

Revision history for this message
j (j-the-k) said :
#2

I built everything using this script: https://github.com/sikuli/sikuli/blob/develop/cleanbuildall.sh
The sikuli-script.jar located in /sikuli-script/target/ contains a tessdata folder with the files from /usr/local/share/tessdata in it.

I found the directory sikuli-script/build/jar which contains the tessdata directory as well. So the files are included..

I deleted the build directories after every change but no luck.

Revision history for this message
RaiMan (raimund-hocke) said :
#3

Looking into the code again, there are 3 steps during the initialization phase:

Step1: copying the tessdata directory from inside the jar to the Java tempdir (path: System.getProperty("java.io.tmpdir") + "/sikuli/"+ "tessdata") (in TextRecognizer.java at init())

Step2: set the environment variable TESSDATA_PREFIX to this path

Step3: initialize Tesseract with: TessBaseAPI::InitWithLanguage(datapath,outputbase,lang,NULL,numeric_mode,0,0); where data path contains the above TESSDATA_PREFIX

BTW: this path is additionally saved at Settings.OcrDataPath during step1

The mentioned error happens in step3 inside Tesseract.
So I guess, that one of these steps generate inconsistent results, that are not checked by Sikuli code.

I have to leave now ;-)

Revision history for this message
j (j-the-k) said :
#4

So after some time I retried to make tesseract work and this time I succeeded..
I should have read the whole tesseract README.

Tesseract did not work because I did not download the tesseract language file (tesseract-2.00.eng.tar.gz) and so the data was missing.

This page explains in short how to set up tesseract correctly: http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html

Maybe this helps someone with the same problem in the future, thanks anyway RaiMan.