OCR class features sikuli IDE

Asked by Eric G on 2020-02-28

Hi I currently have something like:

while 1:
          myValue = int(region.text())
          if myValue>60:
              #do something
          #do something else

The values tesseract return are mostly accurate but there is always a special character or letter attached that messes everything up. I would like tesseract to ignore all special characters aside from integers 0-9, I've seen dozens of similar threads but usually they amount to use bigger font etc etc or read the docs, I've read the tesseract documentation and working with text and ocr features.

I know in python it might look something like:

custom_config = r'--oem 3 --psm 6 outputbase digits'
print(pytesseract.image_to_string(img, config=custom_config))

custom_config = r'-c tessedit_char_whitelist=0123456789--psm 6'
print(pytesseract.image_to_string(img, config=custom_config))

but how do I implement in sikuli ide 2.0.3?

Question information

English Edit question
Sikuli Edit question
No assignee Edit question
Solved by:
Last query:
Last reply:
Eric G (gamemaster181) said : #2

I've read both text+ocr features for sikuli and the doc for tesseract, but I'm still at a loss on how use in ide.
What would it look like if I wanted oem(0) psm(8) and config : digits. May I please get an example?

What I have right now:

Settings.OcrLanguage = "eng"
Settings.Configs = "digits"
#rest of code

the digits config file is "tessedit_char_whitelist 0123456789"

It runs but when I test out (int(myValue.text()) it still returns non integers...what am I doing wrong?

RaiMan (raimund-hocke) said : #3

SikuliX 2.0.x comes with eng.traineddata, that can only be used with oem(3) (standard).
If you try to use oem(0) in this state, your JVM will crash.

So you first have to provide eng.traineddata that are suitable for oem(0):
download from https://github.com/tesseract-ocr/tessdata
and put into the SikuliX app-data folder .../tessdata

If you decide later to switch again to the SikuliX provided eng.traineddata (smaller, faster), then simply put the eng.traineddata away (will be exported automatically the next time).

#define the region containing the word of digits
reg = Region(107,168,49,14)

# tell OCR to use oem 0

# tell OCR to globally use the digits config (comes with SikuliX)

# show the OCR options status

# read from the region as word (psm 8)
print OCR.readWord(reg)

Best RaiMan (raimund-hocke) said : #4

The above solution changes the global options for the complete IDE session (or script-run if run from command-line).

use OCR.reset() to get back to the standard/default.

A variant is to use your own options set:

reg = Region(107,168,49,14)
myOptions = OCR.Options().oem(OCR.OEM.TESSERACT_ONLY).configs("digits")
print OCR.readWord(reg, myOptions)

Eric G (gamemaster181) said : #5

Thanks RaiMan, that solved my question.

Eric G (gamemaster181) said : #6

Super insightful, solved my problem. Thanks as always, you're awesome!