Cuneiform reorders text columnwise - I'd like strict line-by-line processing

Asked by MaXmuc on 2009-08-05

(tried mailing list, but my message somehow didn't appear there - sorry...)

Hi there,

I'm using Cuneiform for Linux 0.6.0 to extract the (german) text from a (digitally sent and received) fax. I then parse the text to extract certain strings identified by keywords on the fax.

While the actual recognition is at about 100%, Cuneiform sometimes seems to group the text into some kind of strangely ordered rectangular blocks and then process the text inside them. As this messes up the order of the text and thus makes post-processing/parsing much harder, I would prefer Cuneiform to always stick to a strict line-by-line processing. But I couldn't find a way to tell Cuneiform to do so.

Is there a way to do this? Some hidden command-line flag? As this badly messes up the order of texts, I think this could be considered as a bug.

Illustration of FAX:

 KEY1 : STRING NO. 1 HERE
 KEY2 : STRING NO. 2 HERE

How Cuneiform seems to process it:

|----------|-----------------------------------|
| KEY1 | STRING NO. 1 HERE |
| KEY2 | STRING NO. 2 HERE |
| ---------|-----------------------------------|

Output(!):

STRING NO. 1 HERE
KEY1
KEY2
STRING NO. 2 HERE

Imagine a lot more keywords, some of them recognized inline, some of them 'ripped out' like this, and strings being potentially multiline. As you can see, it's hard to determine which strings actually refer to which keywords.

How I would love to *always* have the output (most text is actually being recognized like this):

KEY1 : STRING NO. 1 HERE
(STRING NO. 1 MULTILINE)
KEY2 : STRING NO. 2 HERE

Who can help?

Thanks in advance,
Matt

Question information

Language:
English Edit question
Status:
Solved
For:
Cuneiform for Linux Edit question
Assignee:
No assignee Edit question
Solved by:
MaXmuc
Solved:
2009-08-19
Last query:
2009-08-19
Last reply:
2009-08-18
Jussi Pakkanen (jpakkane) said : #1

Cuneiform does have an internal parameter PUMA_Bool32_OneColumn which we do not touch at the moment. Also, I have no idea what it does. You can try fiddling with that in cuneiform-cli.cpp, which does require hacking and compiling the source yourself. If you try it, do post your results here.

MaXmuc (trash-can-gmx) said : #2

It works! I added three lines after your setters. Cuneiform now consequently handles the page as one column. All my problems are solved now!

My c++ is too bad to actually provide a patch. Maybe you want to introduce a parameter --onecolumn instead of hardcoding this.

I don't know how widely this column feature is being used. Maybe you might discuss to make onecolumn standard and provide a parameter --recognizecolumns instead.

Looking forward to this feature in the next version!

Thank you very much!

-------------------------------

// Set the language.
PUMA_SetImportData(PUMA_Word32_Language, &langcode);
PUMA_SetImportData(PUMA_Bool32_DotMatrix, &dotmatrix);
PUMA_SetImportData(PUMA_Bool32_Fax100, &fax);

+ // Testing if singleColumn works...
+ Bool32 singlecolumn = TRUE;
+ PUMA_SetImportData(PUMA_Bool32_OneColumn, &singlecolumn);