Question #271904 “OCR manually improve” : Questions : SikuliX

Revision history for this message

RaiMan (raimund-hocke) said on 2015-09-30:

#1

principally yes ;-)

You have to dive into the teaching of Tesseract (on the Tesseract page).

The teaching of new fonts usually ends up in some additional traineddata and option files, that have to be incorporated into SikuliX's tessdata folder.
Currently it is only possible, to switch between languages.

Come back, when you have decided for a concept.

... and if this is not suitable for you:
it is always possible to install Tesseract (for the teaching you have to do that anyway) and use the Tesseract command from inside SikuliX.

Revision history for this message

Edmundo VN (edmundo-vn) said on 2015-10-01:

#2

Some time ago I needed to do something similar to what you want as I asked here: https://answers.launchpad.net/sikuli/+question/263287, Eugene suggested to kind of make my own OCR. The result is what I answered last, it takes 0.3 seconds to solve a phrase with 100% accuracy. Its not simple, it was made to recognise the text of a combobox, not to recognize an entire page, its intolerant to changes (the text rendered in the screen of a specific configuration is different from another, for example the font in my screen is different from the font in the exact same system inside virtualbox). I made a module, made a Fireworks image with character slices (its very specific to a screenshot of my program) to ease the cut and save work of each character, each character image name is its unicode number .png, the code only takes into account how paths are composed in Linux and its written with comments and variable names in brazilian portuguese. I made moreless what Eugene suggested, I process an area and try to find each character inside it, make a list of each occurrence of each character saving the character and its center position but I don't handle spaces. I sort the list by position, subtract the first position from the last to have the size of the text found and only inside that area I search for spaces, add them to the list, sort by center position again and then process a .py file inside a subdirectory that have the name of the font it holds, this is a set of "rules" to apply to the list, for example, if I process a dialog font, r with n and r with m have the exact same partial picture depending how I cut it, a r cannot have a blank pixel at the right with that font.

(I work with the centers)
If a r is found before a n with 1 pixel difference the r doesn't exist
If a r is found before a m with 3 pixels difference the r doesn't exist
( a n is never found inside a m because of the blank pixel at the right, this rule is not needed)

If you compare this with "region.right(xxx).text()" it seems overkill, but with my dialog 11pt font it have 100% accuracy. Unfortunately I don't know how to teach Tesseract to do that.

Some time ago I needed to do something similar to what you want as I asked here: https://answers.launchpad.net/sikuli/+question/263287, Eugene suggested to kind of make my own OCR. The result is what I answered last, it takes 0.3 seconds to solve a phrase with 100% accuracy. Its not simple, it was made to recognise the text of a combobox, not to recognize an entire page, its intolerant to changes (the text rendered in the screen of a specific configuration is different from another, for example the font in my screen is different from the font in the exact same system inside virtualbox). I made a module, made a Fireworks image with character slices (its very specific to a screenshot of my program) to ease the cut and save work of each character, each character image name is its unicode number .png, the code only takes into account how paths are composed in Linux and its written with comments and variable names in brazilian portuguese. I made moreless what Eugene suggested, I process an area and try to find each character inside it, make a list of each occurrence of each character saving the character and its center position but I don't handle spaces. I sort the list by position, subtract the first position from the last to have the size of the text found and only inside that area I search for spaces, add them to the list, sort by center position again and then process a .py file inside a subdirectory that have the name of the font it holds, this is a set of "rules" to apply to the list, for example, if I process a dialog font, r with n and r with m have the exact same partial picture depending how I cut it, a r cannot have a blank pixel at the right with that font.

(I work with the centers)
If a r is found before a n with 1 pixel difference the r doesn't exist
If a r is found before a m with 3 pixels difference the r doesn't exist
( a n is never found inside a m because of the blank pixel at the right, this rule is not needed)

If you compare this with "region.right(xxx).text()" it seems overkill, but with my dialog 11pt font it have 100% accuracy. Unfortunately I don't know how to teach Tesseract to do that.

Revision history for this message

Edmundo VN (edmundo-vn) said on 2015-10-01:

#4

Maybe it would be possible to apply a filter to the images (to make it black and white and raise the contrast) to solve font smoothing and background differences but I did not tried that.

Revision history for this message

carl (maibannato) said on 2015-10-02:

#5

Thanks Edmundo V. Neto, that solved my question.

SikuliX

OCR manually improve

Question information

Subscribers

SikuliX

OCR manually improve

Question information

Related bugs

Related FAQ:

Subscribers