TesseractOCR ignoring white-spaces?

Asked by Andrew Grabov on 2017-05-17

Hi everyone,

could you please help me to figure out why Tesseract ignores white-spaces?
The text itself is recognised fine, but whitesapce between two words is missing.
I have already increased the space between words, but seems like it just set to ignore them.

Can you suggest which tesseract params try to adjust?
I have tried several from here: http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
but still no success...

Thank you in advance!

Question information

Language:
English Edit question
Status:
Solved
For:
Sikuli Edit question
Assignee:
No assignee Edit question
Solved by:
RaiMan
Solved:
2017-05-17
Last query:
2017-05-17
Last reply:
2017-05-17
Andrew Grabov (andrew-gr) said : #1

P.S. I am on the latest available version of SikuliX - v1.1.1

RaiMan (raimund-hocke) said : #2

No chance with the current implementation in SikuliX version 1.x to improve anything in that direction. Sorry.

If your usage of the OCR feature is not time critical (meaning speed), then you might think about storing images to files and run the tesseract command (you need a separate Tesseract 3 installation) in a subprocess.

Andrew Grabov (andrew-gr) said : #3

HI, RaiMan!

Thank you for quick response!
Can you please point me to any code samples of external OCR process usage?
Or I rephrase my question: what would be the most efficient way of integration with external OCR system?

Thank you in advance for your feedback!

Best RaiMan (raimund-hocke) said : #4

first you have to setup a running version of Tesseract 3.0.
Then you should play around a little with the tesseract command on command line and find out possible ways, to get better results.

Then you have to capture the respective images in your SikuliX workflow, store them to files and issue the tesseract command using the respective SikuliX feature. Then read the result file containing the evaluated text.
I do not have any code samples for that.

Another option is to use the package Tess4J (I will use it with SikuliX version 2 later this year).
The package I will use: https://github.com/RaiMan/Sikulix2tesseract

Andrew Grabov (andrew-gr) said : #5

Okay, thank you, will try and post here any my findings!

Andrew Grabov (andrew-gr) said : #6

Thanks RaiMan, that solved my question.