Tesseract with CJK and 1.1.1

Asked by Barry Janzen

I'm seeing questions on text entry with Chinese, Japanese and Korean, but my tests are written in English and do a lookup to click menus in other languages, and they fail in these languages. For example, to click on "File" in Japanese, I look up
"File"="ファイル" and then do a

click('ファイル')

which doesn't work. With Tesseract 3, it should, correct? Or do I need to build it myself?

Question information

Language:
English Edit question
Status:
Answered
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Barry Janzen (barry-janzen) said :
#1

OK, I got a bit closer, looking at https://answers.launchpad.net/sikuli/+faq/2709

Some things have changed since then: the code.google.com link is now on github at https://github.com/tesseract-ocr/tessdata.

I downloaded the zip, then didn't know where to put the .traineddata files I wanted, so I ran some diagnostics and found my eng.traineddata is located at ~/Library/Application Support/Sikulix/SikulixTesseract/tessdata/eng.traineddata

I put the jpn.traineddata file from github here. I included in my script

import org.sikuli.script.TextRecognizer as TR
Settings.OcrTextSearch = True
Settings.OcrTextRead = True
Settings.OcrLanguage = "jpn"
TR.reset()

When I played around with the language settings, such as Mac's "ja" or Locale "ja_JA", or tesseract's "jpn" I get

Failed loading language 'jpn'
Tesseract couldn't load any languages!

So I'm getting closer, but not quite there.

Revision history for this message
RaiMan (raimund-hocke) said :
#2

yes, you found the currently possible approach, that should work.

I will check it.

Revision history for this message
Barry Janzen (barry-janzen) said :
#3

One other data point: when I took the files from github as is, I would crash Sikuli with this error:

read_params_file: parameter not found: allow_blob_division

So I commented out line 47 in jpn.traineddata:

allow_blob_division F

And that allowed me to run, so it tells me it's finding and opening jpn.traineddata, just not getting it loaded.

Revision history for this message
Barry Janzen (barry-janzen) said :
#4

In Sikuli, I get the "Can't load any languages" if my script does

Settings.OcrTextSearch = True
Settings.OcrTextRead = True
Settings.OcrLanguage = 'jpn'
TR.reset()

So I used brew to install tesseract and see if I could get it to read a png image of the web page https://news.google.com/news?ned=jp. After FAILED attempts that looked like

tesseract -l jpn /tmp/jpn.png /tmp/jpn-out

I read somewhere that the TESSDATA_PREFIX was the critical piece. So I copied my jpn.traineddata to the right spot and ran:

export TESSDATA_PREFIX=/usr/local/Cellar/tesseract/3.04.00/share/;./tesseract -l jpn /tmp/jpn.png /tmp/jpn-out

and it WORKED! (aside - In my .bash_profile, it was set to Sikuli's tesseract, so I overrode it on the cmd line). So back to Sikuli. I tried running my script with a couple of iterations, using both the Sikuli tessdata directory and the brew tessdata directory.

As soon as I add the

Settings.OcrLanguage = 'jpn'

in my script, it throws the "Tesseract couldn't load any languages!" error, which I can reproduce in the brew install if I give it an invalid TESSDATA_PREFIX directory. For example:

export TESSDATA_PREFIX=/tmp;./tesseract -l jpn /tmp/jpn.png /tmp/jpn-out

Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'jpn'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Since it DOES work in English, it must mean that we are getting the directory. For example, if I use the invalid directory with brew tesseract in English, it complains:

export TESSDATA_PREFIX=/tmp;tesseract /tmp/tess-test.png /tmp/tess-out2

Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

So we have the right directory. It's just that Settings.OcrLanguage = 'jpn' is not doing the right thing.

Hope this helps.

Revision history for this message
katsutoshi inuga (k.inuga) said :
#5

>So I commented out line 47 in jpn.traineddata:
Non

You should use combine_tessdata command

1.combine_tessdata -e jpn.traineddata jpn.config
2.edit jpn.config(line 47 comment out)
3.combine_tessdata -o jpn.traineddata jpn.config

You can use jpn.traineddata

It worked in my environment

Can you help with this problem?

Provide an answer of your own, or ask Barry Janzen for more information if necessary.

To post a message you must log in.