[1.1.x] Changing OCR language from English to something else

Asked by edlothiad on 2017-11-06

faq 2709 is now revised:
With SikuliX 1.1.x the internally used version of Tesseract is 3.02, hence you have to use the lang data from here:
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-302

Be aware: the returned strings using Region.text() are unicode strings.

------------------------------------------------------------------
After reading FAQ #2709 (found here: https://answers.launchpad.net/sikuli/+faq/2709) I decided it would allow me to validate the French text in my Interface. After following the steps, downloading both the tessdata-master and langdata-master folders from the link in the FAQ, which now redirects here: https://github.com/tesseract-ocr/, etc. I have tried placing only the fra.traineddata into the SikulixTesseract\tessdata folder and placing the entire fra folder from tessdata-master into the tessdata folder. Both times when changing the language and trying to do the read the text in a region (which is now french) the Java Runtime environment simply stops and puts an error messsage in my SikuliX folder.

With the following message

# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x0000000068b89732, pid=5180, tid=0x0000000000001df4
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C [libtesseract-3.dll+0x189732]
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

Is this error appearing because of something wrong with the setup or am I handling the TextRecognizer wrong, putting the wrong language information in the wrong folders? etc.

Question information

Language:
English Edit question
Status:
Solved
For:
Sikuli Edit question
Assignee:
No assignee Edit question
Solved by:
RaiMan
Solved:
2017-11-13
Last query:
2017-11-13
Last reply:
2017-11-07
RaiMan (raimund-hocke) said : #1

With SikuliX 1.1.x the internally used version of Tesseract still is a version 2.x, hence you have to use the lang data from here:
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-20x

I have updated the faq.

thanks for the post.

edlothiad (edlothiad) said : #2

I've downloaded the new folders, which contain the following files:
- fra.DangAmbigs
- fra.freq-dawg
- fra.inttemp
- fra.normproto
- fra.pffmtable
- fra.unicharset
- fra.user-words
- fra.word-dawg

but no fra.traineddata like it says to add in the faq. So I added the other files instead and it crashed again, however this time instead of the Exception_Access_Violation being at pc=0x0000000068b89732 the Exception_Access_Error is occurring at
pc = 0000000068b1533d. Are there still files I'm missing?

edlothiad (edlothiad) said : #3

Ok, I have stopped it from crashing by using the data files from here: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-302.

Although it's stopped crashing I'm unsure if the language is in fact changing, as the accents don't come out as one would expect and the "é" character comes out as "é". Similarly in German, "ö" becomes "ö". Is there any help you can give with regards to this? Thanks.

RaiMan (raimund-hocke) said : #4

ok, again thanks for not giving up.

My bad: if I would have checked instead of guessing, I would have realized, that Tesseract 3.02 is the correct choice.

I have corrected the answer and faq 2709 accordingly.

Since beginning with Tesseract 3 read text is returned as unicode string, staying with SikuliX 1.1.0 makes problems, since the contained Jython 2.5 does not recognize unicode strings.

I recommend, to upgrade to SikuliX 1.1.1 which has Jython 2.7 (unicode aware).
Additionally using Java 7 or even Java 8 (not Java 9 yet!) would be a good choice.

I made a test with german language like this:
import org.sikuli.script.TextRecognizer as TR
Settings.OcrReadText = True
Settings.OcrLanguage = "deu"
TR.reset()

text = selectRegion().text()
uprint(text) # normal print not unicode aware
popup(text) # unicode aware

which worked as expected and printed the "german umlauts" ä, ü, ö

uprint() is a SikuliX helper function, that internally makes unicode strings printable and can be used like the print statement.

edlothiad (edlothiad) said : #5

Ok, I guess that's where the issue is. As far as I was aware I've been using SikuliX 1.1.1, using the setup you mentioned here: https://answers.launchpad.net/sikuli/+question/657706

Having the sikulixsetp-1.1.1.jar and the SikuliX-1.1.1-SetupLog in my SikuliX folder

The jython-standalone-2.7.0 in my downloads folder (in my SikuliX folder).

Yet the IDE says SikulixIDE 1.1.0 on the top bar. and using both uprint and popup is causing the same unicode error. Should I just be running the setup you linked in the answer above again?

Best RaiMan (raimund-hocke) said : #6

yes, each SikuliX version needs to run the setup, to get the valid jars for that version in the folder, where you run the setup.

Take care, when running version 1.1.1 nothing in your environment points to something related to version 1.1.0

edlothiad (edlothiad) said : #7

Thanks RaiMan, that solved my question.

edlothiad (edlothiad) said : #8

However, if I understood correctly, the uprint function should be unicode aware, however it's failing to print the unicode characters (popup is now printing umlauts correctly) is this something that I'm doing wrong?