cuneiform PDF recognizing

Asked by Sergey Torokhov

Does it plaining to add PDF file regognizing for CuneiFrom?

The problem is that some PDF's with, for example, russian text couldn't be copied normaly directly by "text selection tools" of pdf-viewer due to problems with incorrect internal codepage of pdf-document or internal fonts and after inserting to text document editor.

Now to get text form of such "problematic pdf-file" it's need, as one of path of resolving such task, to open pages of pdf in Gimp "as images" (not as layers!) with 300 dpi at least (for following successful recognition) and save each page as 16bits (for small file size) *.bmp (or *.tiff). Otherwise very big input file (greater than 100 mb ) lead to "*** buffer overflow detected ***: cuneiform terminated" of cuneiform.

After it could to regognize receive *.bmp files by cuneiform in the form of separate text files.

So does you planning to add PDF file recognition at least (к some other way) in the form of next background for user steps:
1. importing of .pdf to temporary .bmp (or .tiff) files with manual setting of dpi resolution;
2. the automatic batch regognition of each of them;
3. automatic creation from recognized images one merged output text file instead of series of them?

Thank you in advance.

Question information

Language:
English Edit question
Status:
Solved
For:
Cuneiform for Linux Edit question
Assignee:
No assignee Edit question
Solved by:
Sergey Torokhov
Solved:
Last query:
Last reply:
Revision history for this message
Yury V. Zaytsev (zyv) said :
#1

1) I think that if cuneiform is compiled with ImageMagick it should be able to open PDFs; if it's not this would be trivial to implement

2) Cuneiform is unmaintained now, so unless you do it yourself or make someone do it, it will not happen.

Revision history for this message
Sergey Torokhov (torohov-s-a) said :
#2

Thanks for quick responce!

1) I has coneiform compiled with imagemagic USE-flag, but it don;t want to process .pdf file with "PUMA_XSave failed" error
("cuneiform -l rus -f text -o zolotko_250.txt ~/test.pdf") .

2) Well, it seems you are right and I need eventually to learn programming language or looking for experienced man to do it :)
It's pitty that project is currently unmaintained.
Or it possible to create a bash script for opening pdf in GIMP in background (or convert directly by imagemagick) and regognize in cuneiform. I'll try to study out with such way or find similar but by the end of year :(

P. S.
Also I want to note that when you save imported .pdf in GIMP to .bmp it's needed and choose 16bits in "Adnvanced options" you must choose "X1 К5 G5 B5" method and NEITHER "R5 G6 B5" NOR "A1 R5 G5 B5", it doesn't affected to recognition of such saved bmp-files but otherwise it's impossible to open them in other programs graphic utils such as viewes (gwenview, gqview, etc.) and opening of such bmp-files in cuneiform front-ends (cuneiform-qt, yagf)

Revision history for this message
Yury V. Zaytsev (zyv) said :
#3

Yes, it's a pity that I can't be of more help, but at least I wanted to answer your question quickly, so that you don't waste your time waiting and making guesses.

Revision history for this message
Sergey Torokhov (torohov-s-a) said :
#4

Great thanks nevertheless :)

As I have very awful programming practice so I look for similar task for bash script and write a short one for converting pdf to bmp with recognizing and merging to single file (maybe it will useful for somebody).

Note that script look for ALL PDF, BMP, TMP (I use this for temporary TXT-files) in current directory!!!
So plase your PDF for converting to individual directory where temporary files also will plased (they will stay here after finishing):

for file in `ls *.pdf`; do
       convert -units PixelsPerInch -density 300 $file `echo $file | sed 's/\.pdf$/\.bmp/'`
done
for file2 in `ls *.bmp`; do
       cuneiform -l ruseng -f text -o `echo $file2 | sed 's/\.bmp$/\.tmp/'` $file2 `echo $file2 | sed 's/\.tmp$/\.bmp/'`
done
for file3 in `ls *.tmp`; do
       cat $file3 >> result.txt
done

I don't know how to merge RTF files - "cat" command isn't for this case.
Then sometimes it more useful to merge to .HTML instead RTF in such case (but with a lot of the same headers inside) as it may looks more clear to view. So its need to change "cuneiform -l ruseng -f text" in script to "cuneiform -l ruseng -f html".

Revision history for this message
Sergey Torokhov (torohov-s-a) said :
#5

Or it's just need install YAGF-0.8.7 front-end that was released on 29 august 2011 (with new feature of import from PDF, by "pdf" USE-flag) to CuneiForm - it's already can convert PDF with recognizing of all pages :)
I just forgot to try new version (it's already was installed on my system 0.8.6 version which is without such feature)

Revision history for this message
JonyGreen (jonygreen) said :
#6

you can try this free online pdf text extractor(http://www.online-code.net/pdf-to-word.html) to extract text from pdf free online.