Need a good OCR solution for linux.

Asked by komputes

I would like to explore what Linux has to offer in the domain of OCR (Optical Character Recognition) so I did some research and started blindly testing kooka, gocr and ocrad. I can't seem to get this to work. If anyone has experience with OCR it would be appreciated.

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu tesseract Edit question
Assignee:
No assignee Edit question
Solved by:
marcobra (Marco Braida)
Solved:
Last query:
Last reply:

This question was reopened

Revision history for this message
Álvaro del Olmo Alonso (dllum) said :
#1

I have experienced with XSane. It comes in the Ubuntu default installation. It works quite well with single texts.
http://www.xsane.org/
Hope that helps.
Regards.

Revision history for this message
komputes (komputes) said :
#2

I am looking for an update for this question.

Alvaro, XSane is great if you have a scanner. XSane opens looking for a scanner and then closes when it does not find one. I need a program where I can feed in a picture, select a block of text and OCR it into editable text which I can copy to a text editor.

Revision history for this message
komputes (komputes) said :
#3

I am looking for an update for this question.

Alvaro, XSane is great if you have a scanner. XSane opens looking for a scanner and then closes when it does not find one. I need a program where I can feed in a picture, select a block of text and OCR it into editable text which I can copy to a text editor.

Revision history for this message
Launchpad Janitor (janitor) said :
#4

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
komputes (komputes) said :
#5

Auto expired. Sill need answer.

Launchpad Janitor wrote:
> Your question #32518 on Ubuntu changed:
> https://answers.launchpad.net/ubuntu/+question/32518
>
> Status: Open => Expired
>
> Launchpad Janitor expired the question:
> This question was expired because it remained in the 'Open' state
> without activity for the last 15 days.
>
>

Revision history for this message
Launchpad Janitor (janitor) said :
#6

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
komputes (komputes) said :
#7

Still requires solution.

Launchpad Janitor wrote:
> Your question #32518 on Ubuntu changed:
> https://answers.launchpad.net/ubuntu/+question/32518
>
> Status: Open => Expired
>
> Launchpad Janitor expired the question:
> This question was expired because it remained in the 'Open' state
> without activity for the last 15 days.
>
>

Revision history for this message
Launchpad Janitor (janitor) said :
#8

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
komputes (komputes) said :
#9

answer still needed

Launchpad Janitor wrote:
> Your question #32518 on Ubuntu changed:
> https://answers.launchpad.net/ubuntu/+question/32518
>
> Status: Open => Expired
>
> Launchpad Janitor expired the question:
> This question was expired because it remained in the 'Open' state
> without activity for the last 15 days.
>
>

Revision history for this message
Launchpad Janitor (janitor) said :
#10

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
komputes (komputes) said :
#11

Unanswered

Launchpad Janitor wrote:
> Your question #32518 on Ubuntu changed:
> https://answers.launchpad.net/ubuntu/+question/32518
>
> Status: Open => Expired
>
> Launchpad Janitor expired the question:
> This question was expired because it remained in the 'Open' state
> without activity for the last 15 days.
>
>

Revision history for this message
Best marcobra (Marco Braida) (marcobra) said :
#12

This are some ocr open-source for Linux

ocrad
tesseract-ocr
gocr-tk

Using System->Administration->Software sources [Ubuntu software tab] please enable (Check the checkbox) all the showed repositories rows.

Then open a Terminal from the menu Applications->Accessories->Terminal and type:

sudo aptitude update
sudo aptitude upgrade
sudo aptitude install ocrad tesseract-ocr gocr-tk

give your user password when requested, you don't see nothing when you type it, then press enter.

Other ocr list:
http://swik.net/OCR+Software
http://www.linux-ocr.ekitap.gen.tr/

Hope this helps

Revision history for this message
komputes (komputes) said :
#13

Thank you marcobra for the suggestions. I have heard good things about tesseract-ocr, so I thought I would give it a try. I finally got it working. It's not perfect but it works better than anything up to date. I found that you should use a high quality tiff image when converting to text through OCR or you are likely to run into spelling errors.

Here's what I did:

in a shell:
$ sudo apt-get install tesseract-ocr tesseract-ocr-eng

Open a scan or an image in GIMP and crop the text out.

Image > Mode > Grayscale
Tools > Color Tools > Threshold (for good contrast)
Image - Mode - Indexed (to black & white - like a fax)
File > Save as jpeg

in a shell:
convert image.jpg image.tif
tesseract image.tif text.txt

Revision history for this message
komputes (komputes) said :
#14

Thanks marcobra, that solved my question.