How-to: Kubuntu Hardy 8.04

Asked by barnacle on 2008-06-30

For information: from a clean installation of Kubuntu 8.04 (Hardy Heron) it can be compiled if you:

sudo apt-get install g++-4.2
sudo apt-get install cpp-4.2
sudo apt-get install build-essential
sudo apt-get install gmake

(or use Adept)

and then follow the instructions in the readme.txt file. This will produce *hundreds* of warnings - mostly about deprecated conversions from strings to character pointers, or comparisons failing (I haven't managed to find a sane way to log the warnings yet since they come from the compiler and not from 'make').

Using a gimp-generated test file with 8-bit greyscale bitmap (scale would be equivalent to about 400dpi):
~~~~~~~~~~~~~~~~~~~~~
Father Tim Shannon finished
combing his thinning red hair, put
down his steel pocket comb, and
took � good long look at himself n
the mirror. Although proud of his
Irish ancestry, he was glad he
wasn' t green any more. The two
sodas he' d sucked down had
quelled the worst of the rebellion
in his stomach. The Neoval and G-
Right he' d washed down with the
first one hadn' t hurt much either,
like � couple extra Hail Marys for
good measure.
"Admit it, Tim," he told his
reflection. "You weren' t cut out
for space travel." That left
unspoken the question of what
he was cut out for.
Not this, that was for sure. When
~~~~~~~~~~~~~~~~~~~~

Original text: (from tesseract on the .tif version of the same text)
~~~~~~~~~~~~~~~~~~~~
Father Tim Shannon finished
combing his thinning red hair, put
down his steel pocket comb, and
took a good long look at himself in
the mirror. Although proud of his
Irish ancestry, he was glad he
wasn't green any more. The two
sodas he'd sucked down had
quelled the worst of the rebellion
in his stomach. The Neoval and G-
Right he'd washed down with the
first one hadn't hurt much either,
like a couple extra Hail Marys for
good measure.
"Admit it, Tim," he told his
reflection. "You weren't cut out
for space travel." That left
unspoken the question of what
he was cut out for.
Not this, that was for sure. When
~~~~~~~~~~~~~~~~~~~

On attempting a larger image with an actual scan, 300dpi, Analog magazine March 1993 page 22, clipped to one column, it segfaulted. Page had a mix of italic and normal text - and had previously been OCR'd by a paid-for version of Cuneiform for Windows.

Neil

Question information

Language:
English Edit question
Status:
Solved
For:
Cuneiform for Linux Edit question
Assignee:
No assignee Edit question
Last query:
2008-07-01
Last reply:
2008-07-23
barnacle (nailed-barnacle) said : #1

Further update, on a scan from the same magazine, again clipped to increasing sizes:

one line: 638*60 pixels, ok
one paragraph: 624*264 pixels, ok
one third of column: 624* 724 pixels, ok
one half of column: 656*1212 pixels, ok
full column: 656*2132 - segfault

Neil

Jussi Pakkanen (jpakkane) said : #2

These issues are explained in readme.txt and the release notes. Briefly:

- yes there are hundreds of warnings, in fact thousands, that's what you get when your code base is 15+ years old
- multicolumn mode is disabled

I don't know why large images crash Cuneiform, but I'm working on it.

barnacle (nailed-barnacle) said : #3

Indeed, I read the notes and wasn't surprised. It was just for information if anyone else should try to build it. I have nothing but admiration for your efforts, Jussi...

I have discovered a quite small image which will crash it - again, it's a single column clipped from a real scan, and has italic text as well as roman. I had a brief look at tracing the code to see how far it got before it crashed, but I haven't yet located all the files. Gets as far as Layout()... :) - do you have perhaps a list of which routine is in which file?

Oddly, on the whole this suits me pretty well as it is - in that I need an OCR reader I can send a single line of text to; I want English only, perhaps with markup for italic (though I can cope with that outwith the OCR section, I think) as a text output. I *don't* need font ID or positional or context information - but I am probably an outlier with very specialist needs.

Looking at some of the warnings, it seems that a lot relate to routines where multi-character - presumably Russian - characters are used, for example. I can't see how these end up in the text at all... Kubuntu Kate just shows them as the default question mark in a diamond, which isn't helpful. But these are the same characters that end up in the finished text, so I can only conclude that it might be worth investigating further - my thought is that some of the bugs may be because the cyrillic character tests are failing (or passing) by default.

I've taken the liberty of modifying your Pumatest to provide an output text name (or in its absence, the input image with .txt appended) - do you want it?

Neil

Jussi Pakkanen (jpakkane) said : #4

The question mark thingy is in the readme as well. They are in Windows' english-russian encoding whose name escapes me now (CP1252 or something).

Cuneiform seems to use some internal 8 byte encoding, which gets transformed on output to something.

Patches are welcome, but at the moment I am overhauling the build system/symbol hiding thing to work better (that is to say, at all). I'll see it once I get that done.

Jussi Pakkanen (jpakkane) said : #5

And by 8 bytes I mean 8 bits.

Jussi Pakkanen (jpakkane) said : #6

Solved.