ocr parameters in java

Asked by cv

Hi,

I' am testing Sikuli (with the java api) and so far I've been pretty impressed with what I have seen. I still have two questions about ocr :

1-I saw that the lib used under the hood is tessaract and that lib need learning for ocr which result should be put in the tessdata folder. Is there a (short) explanation of what that data is and what result we can expect with the data provided ? By the way, I found another data package on the web that seems to give me better result. Is there a online collection of font data that I can use in most cases ?

2-I want to use ocr for automatic test in a game engine where we have raster font of variables size (and quality) with not pure white background). To be honest, it does not work at all in the game engine but works quite well on the windows desktop and other softwares. So is there any parameters that I can tweak through the java api so the threshold could be lower to have more matchings ? I haven't many text field on my "game" so there is few chances to get a lot of false positives even with a lower threshold.

Thanks in advance

Question information

Language:
English Edit question
Status:
Solved
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Solved by:
RaiMan
Solved:
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

- about the tessdata folder you have to look into the Tesseract documentation.

- under the hood Sikuli uses Tesseract 3.0.2 with the standard english language tessdata pack.

- if you use the Region.text() and Region.find("some text") features, you have to live with the poor quality in some usage situations (light font on dark background, unusual fonts, numbers, ... many known problems)

- If you look into the bugs and q&a's here, you will find many stuff about problems and even some solutions.

- the text feature will only be improved (means totally rewritten on base of Tess4J) with the next version 1.2 (first betas might be available towards end of this year).

I am sorry, that I cannot tell you better news.

Revision history for this message
RaiMan (raimund-hocke) said :
#2

BTW: could you tell more about:
.... another data package on the web that seems to give me better result ....

Revision history for this message
cv (camille-viot-external) said :
#3

Thanks for the answer. Even if it doesn't solve my problem it answer many questions I had.

About "If you look into the bugs and q&a's here, you will find many stuff about problems and even some solutions." could you tell me a little more please ? I did not found anything really handy here but perhaps you have something particular in mind that I missed.

And for the data package I might be wrong. I am not sure it was better (I might have not run the exact same test with the same package) but I did a fresh install and lost it. Sorry.

Revision history for this message
Best RaiMan (raimund-hocke) said :
#4

-- but I did a fresh install and lost it
no problem. was only curious.
Until now all freely available OCR solutions somehow rely on Tesseract.
So it might have been, there was something I did not know yet.

-- I did not found anything really handy
… ok, agreed. It mainly shows an overall picture of desperation with Sikuli's oct feture ;-)

the only thing of value might be:
you might implement some training data into the tesseract folder as a different language pack.
to select this pack and make sure it is used:
see https://answers.launchpad.net/sikuli/+question/250795

another thing is:
you might try to improve your images for Sikuli's OCR using any Java graphics features or any available graphics library.
With the latest 1.1.0 preversion you might then use Image.text() to read equivalently to Region.text().
For details, you have to look into javadocs for now, but generally:
BufferedImage better; // some before somehow improved image
Image imgBetter = Image.create(better);
String text = imgBetter.text()

Revision history for this message
cv (camille-viot-external) said :
#5

Thanks RaiMan, that solved my question.

Revision history for this message
JonyGreen (jonygreen) said :
#6

I'm not a developer, i always use the free online ocr http://www.online-code.net/ocr.html to recognize and scan text from image.