xsane2tess problem

Asked by Richard Wilmot

The trouble in post 136345 is still here. Xsane works fine by itself. It works with gocr to do OCR but gocr doesn't give good results, tesseract is much better. But xsane2tess doesn't work with me. The problem seems to be that the .tif file produced by imagemagick isn't recognised by tesseract. Image Viewer doesn't recognise it, either, though Document Viewer and Gimp do. The scanner is HP Photosmart C3180 all-in-one and it produces a PGM image file. On the scan I've been using the file is 2.0MB and the TIF file is 22MB. Perhaps one of the options available in imagemagick will produce a recognisable file but which? I've tried removing the '-compress none' option to no avail. Thanks for your help.

Question information

Language:
English Edit question
Status:
Answered
For:
Ubuntu xsane Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Richard Wilmot (richardglobal) said :
#1

The image type of the .tif image given under 'properties' is PNM if this helps.

Revision history for this message
mycae (mycae) said :
#2

Hello again,

PNM is not a tiff format, it is a Portable aNyMap file, which is different:
https://secure.wikimedia.org/wikipedia/en/wiki/Portable_anymap

Imagemagick is not writing a tiff in this case. Gimp is probably recognizing the file from its contents and thus it can load it anyway.

I suspect that there is something wrong with the arguments at this line:

# converting image into TIFF (ImageMagick)
convert "$FILE_PATH" -compress none "$TIF_FILE" 1>&2

Can you provide the shell trace? (bash -x ./xsane2tess ; or wherever xsane2tess is located). This should show the arguments provided. You can then check the type of $FILE_PATH and $TIF_FILE afterwards.

For example

convert tmp.jpg tmp.tif

then

file tmp.tif

outputs
tmp.tif: TIFF image data, little-endian

Revision history for this message
Richard Wilmot (richardglobal) said :
#3

Hi mycae
Thanks for your continued help. On starting up this morning I tried xsane2tess and it worked! But only once - now we're back to normal.
If I use a jpeg file as the starting point, convert works normally, producing a .tif file of about the same size as the original and tesseract converts this (and Image Viewer likes it, too). If I use the file produced from the scanner via xsane I get (I've shortened the extremely long filename!):

convert xsaneorig.ppm-1000-mvI5gb xsaneorig.tif
convert: unable to open image `xsaneorig.ppm-1000-mvI5gb': @ error/blob.c/OpenBlob/2498.
convert: no decode delegate for this image format `xsaneorig.ppm-1000-mvI5gb' @ error/constitute.c/ReadImage/532

It looks as if xsane is producing a file format that imagemagick doesn't like.

The once when xsane2tess worked, the .tif file was 253kB, about the same length as the original. When it doesn't it's nearly 100 times longer (22.6MB). One thing - the original file should be about 250kB (about 2M pixel x 1 bit/pixel) but is 2MB on the disc.

Revision history for this message
mycae (mycae) said :
#4

try renaming the file to xsaneorig.ppm, rather than xsaneorig.ppm-1000-mvl5gb -- imagemagick may need a bit of a helping hand to identify the file.

you can take the single file aside, and try just running the imagemagick commands yourself, and check the output.

Revision history for this message
Richard Wilmot (richardglobal) said :
#5

Tried it: exactly the same result.
convert: unable to open image `xsaneorig.ppm': @ error/blob.c/OpenBlob/2498. The second line of the original message (...no decode delegate....) isn't there.

Revision history for this message
Richard Wilmot (richardglobal) said :
#6

As it was a PGM file I changed the extension to .pgm and tried again but the same result.

Revision history for this message
mycae (mycae) said :
#7

Can you upload that file?

Revision history for this message
Richard Wilmot (richardglobal) said :
#8

Certainly. I assume you mean the image file. Where do I upload it to? As
an attachment to an email like this or what (I'm new to this!). Thanks
for all your help.

On 18/12/10 18:43, mycae wrote:
> Your question #138107 on xsane in ubuntu changed:
> https://answers.launchpad.net/ubuntu/+source/xsane/+question/138107
>
> Status: Open => Answered
>
> mycae proposed the following answer:
> Can you upload that file?
>

Revision history for this message
Richard Wilmot (richardglobal) said :
#9

If using the image file with xsane2tess within xsane - same result, huge .tif file not recognised by tesseract. If I copy it into my home folder and run the file:

convert /home/richard/xsane-conversion-hpaio:_usb_Photosmart__C3100__series_serial=MY6BGC43N904P9.ppm-1000-lyBush /home/richard/xsaneorig.tif
tesseract -i /home/richard/xsaneorig.tif -o /home/richard/out.txt

then it converts to a proper .tif file (correct size) but produces the result:

+ convert /home/richard/xsane-conversion-hpaio:_usb_Photosmart__C3100__series_serial=MY6BGC43N904P9.ppm-1000-lyBush /home/richard/xsaneorig.tif
+ tesseract -i /home/richard/xsaneorig.tif -o /home/richard/out.txt
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs//home/richard/out.txtCould not open file, -o

Adding the '-compress none' makes no difference (even to the size of the .tif file). Under 'properties' the raw file is given as PGM file and under 'properties - image' the .tif file says 'failed to load image information'. Am I doing something really silly here?

Revision history for this message
mycae (mycae) said :
#10

Well, the tesseract command looks incorrect.

Looking at the manpage for tesseract (man tesseract -- synopsis section), the correct usage would be

tesseract /home/richard/xsaneorig.tif /home/richard/out.txt

(Note the -i and -o are not correct, and should be dropped).

Revision history for this message
Richard Wilmot (richardglobal) said :
#11

Thanks - very stupid of me! When I run my now corrected test file on the
image file it works fine, but still xsane2tess doesn't work in xsane. In
the xsane2tess code what's the 1>&2 bit at the end of the commands for?

Revision history for this message
mycae (mycae) said :
#12

When programs run in a terminal, they can generate two kinds of output -- "standard output" and "standard error" (stdout and stderr).

You can redirect the output to stdout doing this:

echo "hello world" > file.txt

instead of spitting out "hello world" to the console, this will be saved to file.txt.

But if the program spat out errors (what is an error is up to the author of the program - they can use either stream at will), then you will not capture these using the above, and the errors will be written to the terminal.

You can capture errors using the 2> notation (pretend error is a fictional program that spits out an error message)

error "some error occurd" 2> errors.txt

in this case the error would be saved, but normal text would be printed to the screen.

The 1>&2 combines the stdout and stderr into one single stream. Redirecting using 1> would just redirect standard output to a file, not error messages.

http://www.gnu.org/software/bash/manual/bashref.html#Redirections

Revision history for this message
Richard Wilmot (richardglobal) said :
#13

Thanks. I'm not used to these scripts so only vaguely understand these details. It gets us no nearer finding out why imagemagick and tesseract work OK on the file outside xsane2tess, but not inside it. I suppose I'll never know and can only use tesseract in OCRfeeder. The HP software that came with the printer included a version of the IRIS OCR software and it works very well indeed (even better than tesseract!) but, of course, there's no Linux version. Ho hum!! Thanks again, mycae, for all your very patient help. Unless you have a great inspiration, please don't waste any more time on this.

Can you help with this problem?

Provide an answer of your own, or ask Richard Wilmot for more information if necessary.

To post a message you must log in.