[HowTo] poor man's OCR with ImageMagick and tesseract
This is not really a question, but written down, with the intention, that maybe someone else needs it.
I'm running primarily Mac OS X 10.6 with Sikuli 10.2. (additionally Win7 32 as bootcamp on a MacBookPro)
Since my first contact with Sikuli I'm looking forward to get some OCR features integrated into Sikuli. But that's not reality until now.
Now I found a sufficient solution for me, that helps me over the time until Sikuli itself contains a Region.getText() ;-)
Generally looking for some free OCR app for Mac, I found, that most solutions are based on tesseract-ocr (http://
on command line:
tesseract input.tif output
which leaves you with an output.txt containing the recognized text.
Since Sikuli is based on png-files, you have to convert these images to tif before they can be used with tesseract. On top I read in the web, that tesseract delivers best results, if the input is in grayscale. My own tests brought up, that the images at least should have a resolution of 300dpi, to get acceptable results.
Since in the first place I did not want to invest in Python and/or Java programming (where it would have been possible to use the openCV, thats available in Sikuli), I decided to use ImageMagick () for the conversion process:
on commandline:
convert input.png -resample 300 -colorspace Gray output.tif
so output.tif would be the input to tesseract.
Since at least two additional files are produced, I decided to use the Sikuli temp directory and rely on the fact, that during quit of Sikuli at least the temp png's are deleted.
People out there, who live with shell scripting night and day: pls. don't LOL ;-) it works :-)))
this is my script, that I call from inside of a Sikuli script:
#!/bin/sh
export PATH=/opt/
convert $1 -resample 300 -colorspace Gray $1.tif
tesseract $1.tif $1 2>/dev/null
rm $1.tif
mv $1.txt $1
cat $1
Mac scripters: I know that its possible, to set the default path for new processes somewhere else.
the cat prints the text to stdout, so I get it back to Sikuli.
after the script ends, temp only has the original .png, which now contains the text instead and will be deleted by Sikuli
note: no parameter yet to select the OCR language or the resolution
Now in Sikuli I have my OCR feature in a def():
def myOCR(reg = None, debug=False):
import os
if not reg: i = capture()
else:
try:
i = capture(reg.getX(), reg.getY(), reg.getW(), reg.getH())
except:
i = None
if not i: exit(1)
f = os.popen(
lines = f.readlines()
f.close()
if debug:
for x in lines: print x[:-1]
return lines
2 possible uses:
--- based on a given region:
m = find(path-
text = myOCR(m)
--- the user selects the region (only useful for some pre tests)
text = myOCR()
currently, you will not know, what part of the screen is selected
setting debug to True, will additionally print the text to the message area.
--- the return value is a list of lines, that still contain a \n at the end. empty lines (only \n) may be there.
a line may contain non printable characters (I did not yet analyse this situation, but may have to do with character sets (UTF, unicode, ...)).
--- so what do you have to do, if you want to use it:
- get ImageMagick and tesseract-ocr running (I used macports and succeeded at once).
- adapt the shell script to your environment
- adapt the Sikuli def() to your situation (at least path to shell script)
- make your experiences
--- my first experiences
- even rather small text is recognized with a high rate
- the processing time is rather short (less than 0.2 seconds for 1 or 2 lines of text, which is near the average of an optimized find())
- too much grafics and too many different fonts in the region may lower the recognition-rate
- large regions take some time and may return rubbish (somewhere above 600 x 600) (I will not analyse it, since I doubt that this makes sense in a Sikuli script at all)
I have working examples (or just tested with myOCR()):
- read the title of the frontmost app window
- get the name of the frontmost app
- read the tab titles of all tabs in a browser window
- get the names of all running apps from the task window
- read the text from text boxes in pref panes
- read the text of buttons
- read text in pictures in iPhoto
- ...
If you do it, pls. talk about your experiences.
Question information
- Language:
- English Edit question
- Status:
- Expired
- For:
- SikuliX Edit question
- Assignee:
- No assignee Edit question
- Last query:
- Last reply: