New (norwegian) tesseract training set crashes Sikuli?

Asked by Audun Mathias Øygard

Hi,

I'm having some problems getting a tesseract training set for norwegian to work in Sikuli.

The training set was created for tesseract 2.04 as described here:
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract2

My training set works with tesseract, but when I exchange the english training set in the sikuli-script.jar with my training set, Sikuli crashes whenever I try to do image captures or try to get the text in an image. Since my training set includes non-english characters (æ,ø,å), I was wondering if this is the reason Sikuli crashes. Or is there another "proper" way of doing it?

The files I've exchanged (with identically named files) are:
/tessdata/eng.freq-dawg
/tessdata/eng.inttemp
/tessdata/eng.normproto
/tessdata/eng.pffmtable
/tessdata/eng.unicharset
/tessdata/eng.user-words
/tessdata/eng.word-dawg
/tessdata/eng.DangAmbigs

Happens on Sikuli X.RC2 on both ubuntu and windows vista.

Question information

Language:
English Edit question
Status:
Solved
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Audun Mathias Øygard (amoygard) said :
#1

I can verify that my training set works when I only use english characters, so I assume it's something to do with non-english characters/UTF-8.

Another issue that came up is that Sikuli is not properly handling characters which are (correctly) detected as bold by my training set. Bold characters are output with an @-symbol in front of them, so a bold a would be "@a", but Sikuli only outputs the @-symbol.

I guess I'll have to remove detection of bold characters from my training set, but it would be nice if it worked.

Revision history for this message
RaiMan (raimund-hocke) said :
#2

Thanks for information and evaluation.

I will make it a request bug and add this to the OCR summary bug 710586

Revision history for this message
RaiMan (raimund-hocke) said :
#3