OCR class features sikuli IDE

Asked by Eric G

Hi I currently have something like:

while 1:
     try:
          myValue = int(region.text())
          if myValue>60:
              #do something
     except:
          #do something else

The values tesseract return are mostly accurate but there is always a special character or letter attached that messes everything up. I would like tesseract to ignore all special characters aside from integers 0-9, I've seen dozens of similar threads but usually they amount to use bigger font etc etc or read the docs, I've read the tesseract documentation and working with text and ocr features.

I know in python it might look something like:

custom_config = r'--oem 3 --psm 6 outputbase digits'
print(pytesseract.image_to_string(img, config=custom_config))

or
custom_config = r'-c tessedit_char_whitelist=0123456789--psm 6'
print(pytesseract.image_to_string(img, config=custom_config))

but how do I implement in sikuli ide 2.0.3?

Question information

Language:
English Edit question
Status:
Solved
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Solved by:
RaiMan
Solved:
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1
Revision history for this message
Eric G (gamemaster181) said :
#2

I've read both text+ocr features for sikuli and the doc for tesseract, but I'm still at a loss on how use in ide.
What would it look like if I wanted oem(0) psm(8) and config : digits. May I please get an example?

What I have right now:

Settings.OcrLanguage = "eng"
Settings.Configs = "digits"
OCR.PSM.SINGLE_WORD
OCR.OEM.TESSERACT_ONLY
#rest of code

the digits config file is "tessedit_char_whitelist 0123456789"

It runs but when I test out (int(myValue.text()) it still returns non integers...what am I doing wrong?

Revision history for this message
RaiMan (raimund-hocke) said :
#3

SikuliX 2.0.x comes with eng.traineddata, that can only be used with oem(3) (standard).
If you try to use oem(0) in this state, your JVM will crash.

So you first have to provide eng.traineddata that are suitable for oem(0):
download from https://github.com/tesseract-ocr/tessdata
and put into the SikuliX app-data folder .../tessdata

If you decide later to switch again to the SikuliX provided eng.traineddata (smaller, faster), then simply put the eng.traineddata away (will be exported automatically the next time).

#define the region containing the word of digits
reg = Region(107,168,49,14)

# tell OCR to use oem 0
OCR.globalOptions().oem(OCR.OEM.TESSERACT_ONLY)

# tell OCR to globally use the digits config (comes with SikuliX)
OCR.globalOptions().configs("digits")

# show the OCR options status
OCR.status()

# read from the region as word (psm 8)
print OCR.readWord(reg)

Revision history for this message
Best RaiMan (raimund-hocke) said :
#4

The above solution changes the global options for the complete IDE session (or script-run if run from command-line).

use OCR.reset() to get back to the standard/default.

A variant is to use your own options set:

reg = Region(107,168,49,14)
myOptions = OCR.Options().oem(OCR.OEM.TESSERACT_ONLY).configs("digits")
print OCR.readWord(reg, myOptions)

Revision history for this message
Eric G (gamemaster181) said :
#5

Thanks RaiMan, that solved my question.

Revision history for this message
Eric G (gamemaster181) said :
#6

Super insightful, solved my problem. Thanks as always, you're awesome!