[1.1.1+] Tesseract (Region.text()): how to switch to a different language pack

Created by RaiMan on on 2015-03-25
Keywords:
Last updated by:
RaiMan on on 2017-11-08

Valid for version 1.1.1+ (not version 2)

--- background knowledge:
----------------------------------------

The SikuliX local application data folder
SikuliX stores some information, that is needed during runtime in a system dependent special folder in the user home section:
Windows: in a folder Sikulix in the folder pointed to by the environment variable %APPDATA%
Mac: in ~/LibraryApplication Support/Sikulix
Linux: in ~/.Sikulix

Currently you might find the following subfolders there:
Extensions* - intended to contain extension and plugin artefacts (currently empty)
Lib* - the stuff needed to support scripting with Jython and JRuby
SikulixDownloads - non-SikuliX downloads needed for setup (Jython, JRuby, Tesseract-tessdata, ...)
SikulixDownloads_201503220100 - SikuliX downloads (suffix is the timestamp of the active SikuliX version)
SikulixLibs_201503220100 - SikuliX native libraries for this system (suffix is the timestamp of the active SikuliX version)
SikulixStore* - contains other files, that are loaded/produced during runtime (debug, options, last script from net, ...)
SikulixTesseract* - files to support the usage of Tesseract (currently tessdata)

--- get the Tesseract language pack
----------------------------------------------------

--1. go to the page:
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-302

--2. select the language you want to use and download the pack.
what you download are ....tar.gz files, that have to be unpacked with a tool like 7Zip or UnArchiver.
what you get is a folder structure (-- are folders):
-- tesseract-ocr
    -- tessdata
       some files

--3. copy the language files to the SikuliX area
copy all files in the downloaded folder tessdata to the folder tessdata in the above mentioned SikuliX local application data folder in the subfolder SikulixTesseract (where you already have eng.traineddata).

--4. switch on the Ocr feature and select the language to use for recognition

in script:
import org.sikuli.script.TextRecognizer as TR
Settings.OcrReadText = True
Settings.OcrLanguage = "language"
TR.reset()

in Java:
import org.sikuli.script.TextRecognizer;

Settings.OcrReadText = true;
Settings.OcrLanguage = "language";
TextRecognizer.reset()

where language should be the name from the language.traineddata file name (see example below).

now using the Region.text() feature should return reasonable results.

-- example:
-----------------

- on the download page you select the entry with the title
Chinese (Simplified) language data for Tesseract 3.02

- and get the following file on your machine:
tesseract-ocr-3.02.chi_sim.tar.gz (or only .tar if your system auto-unzips)

- after unpacking your get
-- tesseract-ocr
    -- tessdata
        chi_sim.traineddata

- copy chi_sim.traineddata to the subfolder tessdata in SikulixTesseract in the SikuliX local application data folder

- use in script
import org.sikuli.script.TextRecognizer as TR
Settings.OcrReadText = True
Settings.OcrLanguage = "chi_sim"
TR.reset()

- to switch to another language in the same script later just this:
Settings.OcrLanguage = "eng"
TR.reset()