[HowTo] use external Tesseract install for OCR (version 3+) --- workaround

Asked by Andrew Gr on 2017-05-17

********* workaround for all who are somewhat unsatisfied with quality and handling of the builtin Tesseract OCR support based on version 2 features (Thanks to Andrew Grabov)
----------------------------------------------------------------------
so I was able to connect with the external Tesseract and I would say that it works absolutely fine!

I haven't noticed any speed issues of Sikulx script processing, maybe because the laptop that runs script is pretty powerful and has SSD drive, thus operations with files don't have noticeable effect on performance.

As per OCR quality: it is noticeably better. Basically I have compiled both versions (3.05 and 4), and both work fine. What is good that once you have separate installation of Teserract you can have full control over it.

And some code snippits that responsible for the texts extraction from the image:

##############################################

TESSERACT_EXEC = "\"C:\\Program Files\\tesseract3\\tesseractmain.exe\"";
TESSERACT_TESSDATA = "\"C:\\Program Files\\tesseract3\\tessdata\"";

def getText(region):
    pathToImg = Screen().capture(region).getFilename()
    output = run(TESSERACT_EXEC + " " + pathToImg + " " + pathToImg + " " + "--tessdata-dir " + TESSERACT_TESSDATA)
    return (readFile(pathToImg + ".txt"))

def readFile(pathToFile):
    with open(pathToFile, 'r') as file:
        #return file.read()
        return file.read().replace('\n', '')

##############################################

Works like a charm! ;)

--------------------------------------------------------------------------------------------------------------------

Hi everyone,

could you please help me to figure out why Tesseract ignores white-spaces?
The text itself is recognised fine, but whitesapce between two words is missing.
I have already increased the space between words, but seems like it just set to ignore them.

Can you suggest which tesseract params try to adjust?
I have tried several from here: http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
but still no success...

Thank you in advance!

Question information

Language:
English Edit question
Status:
Solved
For:
Sikuli Edit question
Assignee:
No assignee Edit question
Solved by:
RaiMan
Solved:
2017-05-17
Last query:
2017-05-17
Last reply:
2017-05-17
Andrew Gr (andrew-gr) said : #1

P.S. I am on the latest available version of SikuliX - v1.1.1

RaiMan (raimund-hocke) said : #2

No chance with the current implementation in SikuliX version 1.x to improve anything in that direction. Sorry.

If your usage of the OCR feature is not time critical (meaning speed), then you might think about storing images to files and run the tesseract command (you need a separate Tesseract 3 installation) in a subprocess.

Andrew Gr (andrew-gr) said : #3

HI, RaiMan!

Thank you for quick response!
Can you please point me to any code samples of external OCR process usage?
Or I rephrase my question: what would be the most efficient way of integration with external OCR system?

Thank you in advance for your feedback!

Best RaiMan (raimund-hocke) said : #4

first you have to setup a running version of Tesseract 3.0.
Then you should play around a little with the tesseract command on command line and find out possible ways, to get better results.

Then you have to capture the respective images in your SikuliX workflow, store them to files and issue the tesseract command using the respective SikuliX feature. Then read the result file containing the evaluated text.
I do not have any code samples for that.

Another option is to use the package Tess4J (I will use it with SikuliX version 2 later this year).
The package I will use: https://github.com/RaiMan/Sikulix2tesseract

Andrew Gr (andrew-gr) said : #5

Okay, thank you, will try and post here any my findings!

Andrew Gr (andrew-gr) said : #6

Thanks RaiMan, that solved my question.

Andrew Gr (andrew-gr) said : #7

Just some update on this topic,

so I was able to connect with the external Tesseract and I would say that it works absolutely fine!

I haven't noticed any speed issues of Sikulx script processing, maybe because the laptop that runs script is pretty powerful and has SSD drive, thus operations with files don't have noticeable effect on performance.

As per OCR quality: it is noticeably better. Basically I have compiled both versions (3.05 and 4), and both work fine. What is good that once you have separate installation of Teserract you can have full control over it.

And some code snippits that responsible for the texts extraction from the image:

##############################################

TESSERACT_EXEC = "\"C:\\Program Files\\tesseract3\\tesseractmain.exe\"";
TESSERACT_TESSDATA = "\"C:\\Program Files\\tesseract3\\tessdata\"";

def getText(region):
    pathToImg = Screen().capture(region).getFilename()
    output = run(TESSERACT_EXEC + " " + pathToImg + " " + pathToImg + " " + "--tessdata-dir " + TESSERACT_TESSDATA)
    return (readFile(pathToImg + ".txt"))

def readFile(pathToFile):
    with open(pathToFile, 'r') as file:
        #return file.read()
        return file.read().replace('\n', '')

##############################################

Works like a charm! ;)

RaiMan (raimund-hocke) said : #8

Thanks for the feedback.

I put your comment at the top as a possible workaround with version Sikulix 1.1.x