OCR manually improve

Asked by carl

I am new to the OCR world.
but I would like to know if I can manually improve OCR.

In my 1° case the characters are all in a single font and size. (like on blog)
if I take a sample image for each character I can be able to improve my situation? and how to do that?
then OCR should only compare and overcome some background problems...

Question information

Language:
English Edit question
Status:
Solved
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Solved by:
Edmundo VN
Solved:
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

principally yes ;-)

You have to dive into the teaching of Tesseract (on the Tesseract page).

The teaching of new fonts usually ends up in some additional traineddata and option files, that have to be incorporated into SikuliX's tessdata folder.
Currently it is only possible, to switch between languages.

Come back, when you have decided for a concept.

... and if this is not suitable for you:
it is always possible to install Tesseract (for the teaching you have to do that anyway) and use the Tesseract command from inside SikuliX.

Revision history for this message
Best Edmundo VN (edmundo-vn) said :
#2

Some time ago I needed to do something similar to what you want as I asked here: https://answers.launchpad.net/sikuli/+question/263287, Eugene suggested to kind of make my own OCR. The result is what I answered last, it takes 0.3 seconds to solve a phrase with 100% accuracy. Its not simple, it was made to recognise the text of a combobox, not to recognize an entire page, its intolerant to changes (the text rendered in the screen of a specific configuration is different from another, for example the font in my screen is different from the font in the exact same system inside virtualbox). I made a module, made a Fireworks image with character slices (its very specific to a screenshot of my program) to ease the cut and save work of each character, each character image name is its unicode number .png, the code only takes into account how paths are composed in Linux and its written with comments and variable names in brazilian portuguese. I made moreless what Eugene suggested, I process an area and try to find each character inside it, make a list of each occurrence of each character saving the character and its center position but I don't handle spaces. I sort the list by position, subtract the first position from the last to have the size of the text found and only inside that area I search for spaces, add them to the list, sort by center position again and then process a .py file inside a subdirectory that have the name of the font it holds, this is a set of "rules" to apply to the list, for example, if I process a dialog font, r with n and r with m have the exact same partial picture depending how I cut it, a r cannot have a blank pixel at the right with that font.

(I work with the centers)
If a r is found before a n with 1 pixel difference the r doesn't exist
If a r is found before a m with 3 pixels difference the r doesn't exist
( a n is never found inside a m because of the blank pixel at the right, this rule is not needed)

If you compare this with "region.right(xxx).text()" it seems overkill, but with my dialog 11pt font it have 100% accuracy. Unfortunately I don't know how to teach Tesseract to do that.

Revision history for this message
Edmundo VN (edmundo-vn) said :
#4

Maybe it would be possible to apply a filter to the images (to make it black and white and raise the contrast) to solve font smoothing and background differences but I did not tried that.

Revision history for this message
carl (maibannato) said :
#5

Thanks Edmundo V. Neto, that solved my question.