Documentation for Sikulix/Tesseract usage?

Asked by Geoff Bache

I've been using the OCR features with a certain amount of success, but inevitably it doesn't work all of the time. In particular the OCR seems to be much worse as soon as it encounters white text on a dark background. Internet opinion seems divided as to whether Tesseract is just like this, or whether it shouldn't be a problem...

I've also been noticing that Region.text() sometimes finds text that find(text) doesn't.

In any case, I'd like to try to debug this and understand it, and potentially ask the Tesseract people for tips. Is there some kind of documentation or guide to how Sikulix uses Tesseract, for example comparable to the explanation for how it uses OpenCV? Is there an equivalent to find(text) in Tesseract, or is that algorithm implemented entirely in Sikuli?

Question information

Language:
English Edit question
Status:
Solved
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Solved by:
Geoff Bache
Solved:
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

I am sorry to say: the implementation of the Tesseract C++ API in Sikuli is a mess.
But it is as it is since X1.0-RC3 was released by the former developers.
The internally used features are on the level of Tesseract 2 and there are only some minor fixes, that allow to use Tesseract 3.
But these fixes do not apply to the text() feature (OCR) nor to the findText() feature (currently find(some text)).
So the quality of OCR and the searching of text are still on the initial (poor) level.

A light font on dark background is a problem with Tesseract and the recommendation is to switch the image to dark on light background before giving into Tesseract. The only thing according to the recommendations of Tesseract OCR is the image conversion to grey-scaled and some rescaling to meet the recommendation of using 300dpi images.

I will touch this area only later this year with version 2 and will definitely use Tess4J for OCR (Region.text()
I have not yet decided, how I will tackle the findText() feature, but if possible then also with Tess4J.

So if you want to know, how SikuliX uses it:
You simply have to understand the mostly C++ code, which is nearly not documented.

What I have added recently in version 1.1.0 (faq 2709) is the possibility to switch to a different language pack, which theoretically includes the possible use of your own traineddata.
Additionally Tesseract allows different optionfiles in the tessdata folder for different goals. I have not yet tested, wether the current approach of SikuliX lets Tesseract recognise these options.

... and yes, text() and findText() are complete different implementations, so if one can do, it does not mean that the other does as well and vice versa.

If you go into the code and find something to improve: always welcome.

Just fork the github repo and send pull requests.

Revision history for this message
Geoff Bache (geoff.bache) said :
#2

OK, thanks.

There seem to be many ways to configure Tesseract. For example, I would dearly love to restrict the characters it recognises so it doesn't keep thinking it's found some weird letter I've never seen before :)

http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for

Is there a way to do this from Sikuli, or indeed to do any other Tesseract configuration via Sikuli? You mention "different optionfiles in the tessdata folder", which sounds promising...

Revision history for this message
RaiMan (raimund-hocke) said :
#3

Yes, exactly this is the option system of Tesseract.

... but this is how you can tell Tesseract which option files to use, when running the standalone Tesseract command from command line.

When using the API as it is done inside SikuliX, you have to set some initialisation parameters for the TessBaseAPI at initialisation/startup.

I have already implemented this for the case to choose a different language as mentioned in comment #1.
deep down in the C++ code there is already a function, that allows to set any init-variable for Tesseract at initialisation, but this is currently not available/accessible at a higher level in the Java space.

All this can be found in TextRecognizer.java, vision.cpp and tessocr.cpp.

So currently I cannot help you.
I turn your question into a request bug and see, wether it is possible to get at least the parameter handover to Tesseract with version 1.1.0 or at least with a 1.1.1 later.

Revision history for this message
Geoff Bache (geoff.bache) said :
#4

Thanks again for your prompt answers and engagement here. Look forward to a fix for the request you created.