ICR vs OCR

Asked by Andy

Hi,

I'm using QueXF 1.13.5 and I'm getting fairly poor ICR success rates. Success is poor even when I re-process the same documents from which I previously manually trained ICR. I would expect a high success rate in that scenario.

From reading other posts, I understand that the ICR process is under review and bugfixes are allocated to release 1.14.0 scheduled for March 2014.

To work around this, is there a separate, more robust OCR feature which is able to read machine typed field content? In other words, could I pre-fill the fields, print the forms and read that fielded data reliably using QueXF even while the ICR problems exist?

Thanks,
Andy

Question information

Language:
English Edit question
Status:
Solved
For:
queXF Edit question
Assignee:
No assignee Edit question
Solved by:
Andy
Solved:
Last query:
Last reply:
Revision history for this message
Adam Zammit (adamzammit) said :
#1

Hi Andy,

For machine typed content - I'd suggest either:

a. If the data is able to be converted to a barcode (codabar or i25), convert it and print this instead of numbers/text, and use queXF to read the barcode field. This is highly accurate.

b. Look at the code for older versions of queXF (1.11.2 and earlier) - and see how it can do "OCR" by exporting an image and calling an external program like tesseract to return the result.

Adam

Revision history for this message
Andy (andyb0070) said :
#2

HI Adam,

I'm doing what you suggested and attempting to shell out to Tessaract to do the OCR.

There is a deprecated function in functions.ocr.php which I have re-instated but on Windows 7, the following exec fails:

 exec(CONVERT_BIN . " $tmpfname.wbmp -compress none -monochrome $tmpfname.tif");

I've checked the config and called them in a windows command window successfully. I'm not seeing any errors from php.

Do you know what the problem might be? Is exec problematic on Windows 7?

Thanks,
Andreas.

Revision history for this message
Andy (andyb0070) said :
#3

I've resolved my issue.

On Windows, the config settings have to include the full path, enclosed in double quote. This works:

//Old OCR Stuff
if (!defined('CONVERT_BIN')) define('CONVERT_BIN', '"C:\\Program Files\\ImageMagick-6.8.7-Q16\\convert.exe"');
if (!defined('TESSERACT_BIN')) define('TESSERACT_BIN', '"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"');

Futhermore, in functions.ocr.php - tesseractor, the exec statements look like this. The tesseract call requires a pagesegmode parameter 10 to treat the image as a single character. Like this:

//call ImageMagick
exec(CONVERT_BIN . " $tmpfname.wbmp -compress none -monochrome $tmpfname.tif");

//call tesseract
exec(TESSERACT_BIN . " $tmpfname.tif $tmpfname -psm 10");