Infinite loop when detecting spaces

Asked by Eugene S

Hi all,

As an attempt to create an alternative for built tesseract OCR, I thought about the following idea (high-level):

1. Create a screenshot for each character (screenshot for 'a', screenshot for 'b', etc...)
2. Iterate over each character in a word and compare to a collection of characters screenshots. The one with perfect match - is the letter.

I know it might be not super efficient and/or quick but as long as it provides consistent results, it's enough for me.

So a first challenge would be "segmentation" (character isolation). To do that I thought to detect the spaces between letters assuming a single 1 pixel wide and couple of pixels high bar of empty space as a separator. So I have created a pattern image which is basically a 1xN bar of white pixels.

As a next step I have created an image pattern of a short string of plain text and ran the following algorithm to validate that the gaps between letter are detected correctly:

text = find("sampleText.png") # a short string of text

for x in text.findAll("sampleTextSeparator.png"): # a 1xN bar
 x.highlight(1)

However it seems that instead of iterating over all the gaps in this text , the algorithm just finds and highlights the same gap each time. I have tried to count the number the loop is running and it's 100! (it should be 25, including the spaces between words).

Any ideas why such behavior might happen?

Cheers,
Eugene S

Question information

Language:
English Edit question
Status:
Answered
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

For this find on the whole region is not appropriate, since you cannot control, that it steps to the right with the next find() on the same region.

so you either have to step through the region text yourself and check every bar or try with a findAll().

One more thing:
I do not want to discourage you, but this will never be efficient:
- depending on the font used, there might not be a one-colored gap between characters (look at a magnified version of the text above "Can you help with this problem?") - the gaps vary and some are not put white.
- a simple calculation:
-- to search a gap takes 5 msecs
-- to compare a character takes 10 msecs
-- lowercase and uppercase characters plus digits and some special characters sum up to about 70
-- so the average character identification takes 35 compares - about 400 msec minimum (including the gap search per character)
-- so to read a 5 character word will be minimum 2 seconds (plus pre and post processing)

I do not think this makes sense.

A possible approach though might be, to have an image of all possible characters, numbers and signs and a description (list, map, ...), that holds the information, which character is where in the row.
If you manage to isolate a character image in your text region (the one to do OCR on), you might use the capture of this character and search it in the prepared alphabet-image.
This should cut down the time spent per character to the half or even some less.

But still not efficient.

I think, if you really want to do OCR with Sikuli, you should use the builtin Tesseract features, which are rather well for the bundled english tessdata set with version 1.0.1. For other languages feel free to install the appropriate set. And finally, you always might use the Tesseract utilities to make it "learn" more fonts.

Revision history for this message
Eugene S (shragovich) said :
#2

Hi RaiMan and thank you for your answer.

Regarding the control over the direction of the step, I though to solve this issue in the following way.. To find all gaps and memorize the location of each one of them. Then, when I have all locations, I can map them on the source image and thus derive the location of the actual characters.

Regarding the gaps.. IN MY CASE, I can assume that there is only one single pixel wide gap between each pair of letters.
I completely agree that the proposed algorithm is far from efficient, however the ability to correctly and consistently recognize text will really make my automation easier.

Regarding the other algorithm you have proposed.. I think it is similar to the one I have explained in my question. Or did I miss something? :)

I have put some effort into looking for some info about tesseract training however I wasn't able to make any progress in that direction. From what I have found (please correct me if I am wrong), it seems that tesseract is designed to recognize handwritten text rather than printed text. And I have really noticed that text with bolder/fatter fonts are recognized much more accurately (with Sikuli) than just simple plain type (for example like this one). It doesn't make any sense to me and I would've think that printed text would be much easier to recognize.

Anyway, I know from your answers in other questions that you don't have enough experience with tesseract training so I won't ask it here. But I would like to ask your opinion on another approach I have thought of.

I believe, that I if I could do some preprocessing to a text(before recognizing), specifically make it bolder/bigger/fatter, the recognition will perform much better. So I thought about doing some basic image processing in Python. I think there are image processing tools provided by SciPy\NumPy modules. However I am not sure it will work smooth with Jython.

Should I give that a try or it will be a waste of time?

Thanks again!
Eugene S

Revision history for this message
RaiMan (raimund-hocke) said :
#3

I just made a test with this simple script:

Settings.MoveMouseDelay=0
img = "1386938448955.png" # headline of this question: Infinite loop when detecting spaces
m = find(img)
highlight(-1)
nxt = m.left(1).right(1)
gap = capture(nxt)
start = time.time()
n = 0
for i in range(m.w):
    nxt = nxt.right(1)
    if nxt.exists(Pattern(gap).similar(0.99), 0):
        nxt.hover()
        n += 1
total = time.time() - start
print m.w, total/m.w, n, total, total/n

result with version 1.0.1:
493 0.264811359119 113 130.552000046 1.15532743403

result with the current development version of 1.1 (soon as Beta1):
493 0.0113164300125 122 5.57899999619 0.0457295081655

This is the first time I found an example, where the optimizations in 1.1 are such impressive: only 6 seconds versus 130 seconds in 1.0.1 --- woooow about 20 times faster ;-)

So I think, you might develop your approach and make it working accurately and robustly.
The wanted speed will come with 1.1 ;-)

*** Regarding the other algorithm you have proposed.. I think it is similar to the one I have explained in my question. Or did I miss something? :)

I think not really:
suggested, you have isolated the characters, then with your version, you have to compare each character with the captured symbols (which are about 70 +/-). You might arrange them according to their frequency in your texts, but in average their will be around 20+ compares per character (20 searches in very small images)
If you take my approach, then it is one search in an image containing all characters in defined places (which might be a bit faster on average).

BTW: you can optimise your approach by checking for each character, wether it is uppercase or lowercase (check the top part of the space on empty) and wether it is a character having a decender (checking the bottom space on empty). This might reduce the average character recognition, since you only search in a specific group of characters.

Can you help with this problem?

Provide an answer of your own, or ask Eugene S for more information if necessary.

To post a message you must log in.