How best to apply additional ICR training?

Asked by Wayne Lewis

Firstly, thanks for making queXF open source. It's an interesting project and as far as I can see the only one in the FOSS world that tackles this kind of forms processing.

We're running into a bit of ICR confusion, however. After the initial ICR training and a period of normal operation, what is the best way to add further training to the system? This might be required, for example, because the form-filling team has a new member whose hand-printing style the system is of course unfamiliar with.

Before verification, on each character's manual training page queXF lists the images that it recognized and mis-recognized as that character. After verification, the previously correctly recognized images are listed twice each, which seems odd, and the mis-recognized images remain. Is this expected behaviour? If so it seems to imply that additional training must be done manually, character by character, rejecting the mis-recognized images for each character. Is this correct?

Another thing we've noticed is that rejecting an image doesn't remove that image from the character's list of images -- it's there permanently. Is there any way of removing it or, better yet, re-assigning it to its proper character?

Thanks.

Question information

Language:
English Edit question
Status:
Answered
For:
queXF Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Adam Zammit (adamzammit) said :
#1

Dear wkl,

I am not sure why previously correctly recognised images are being listed twice. It is expected that mis-recognised images remain as queXF doesn't check for a match against the original ICR result.

The idea of the training process is that if each form has been verified by an operator - the image should match the character the verification operator has entered. This would allow the training process to proceed basically automatically.

Currently manual training should allow you to enter a new character if it is is mis-classified by a verifier - but I have just checked this and due to the "click to disable" javascript function of the manual training - it does not allow you to enter a new character. This is a bug and I will file it shortly.

Regarding rejecting an image - this is a good idea too - it should be able to be marked in a way that it won't appear again for the purposes of ICR training. I will file an additional bug for this.

Regards,
Adam Zammit

Revision history for this message
Wayne Lewis (wayne-lewis) said :
#2

Adam,

Thanks for the reply. I don't understand this part:

"It is expected that mis-recognised images remain as queXF doesn't check for a match against the original ICR result."

Perhaps it would help to describe a test procedure that illustrates my problem (it's a little long-winded -- sorry):

* Start with a fresh database. This has 26 tables, all empty except for 7 records in the 'boxgrouptypes' table.

* Import a new questionnaire from a PDF file along with its banding file.

* Add an operator and assign him/her to the questionnaire.

* Load a scanned form by selecting 'Import a directory of PDF files' and pressing the 'Process directory' button.

* Verify the scanned form and press 'Submit completed form to database'.

* Select 'Train ICR', then the form's name, then 'Continue training'. Selecting any of the 'Manually train' links will show us no discrepancies between the character and its images (assuming verification was done correctly).

* Press 'Start training process in background' to initiate automatic training. It takes a short while to run.

* Assign the resulting ICR KB to the questionnaire.

* Load a second scanned form by selecting 'Import a directory of PDF files' and pressing the 'Process directory' button. It takes a bit longer this time because ICR is happening.

* Verify the scanned form (recognition is poor because for this test we have used only one scanned form to train the ICR) and press 'Submit completed form to database'.

* Select 'Train ICR', then the form's name, then 'Continue training'.

Now this is the bit I don't understand. For some characters, selecting 'Manually train' will show a list of images that do match the character; for other characters, selecting 'Manually train' will show a list of images, some of which match and some of which don't match the character. This shouldn't happen after correct verification, should it? Also, in both cases, among the images that do match the character there are often duplicates.

This makes automatic training impossible, of course. Are you able to reproduce this behaviour? If not, do you have any ideas what might cause this to happen -- perhaps there's something amiss with my queXF installation? Am I going about adding extra training in the wrong way?

By the way, probably not related but I'm getting a lot of these messages in my web server error log:

[Fri Oct 31 11:25:57 2014] [error] [client xx.xx.xx.xx] FastCGI: server "/usr/lib/cgi-bin/php5-fcgi" stderr: PHP message: PHP Notice: Undefined offset: 10 in /var/www/quexf/functions/functions.ocr.php on line 372, referer: http://myserver/quexf/admin/import.directory.php

Thanks.

-- Wayne Lewis

Revision history for this message
Adam Zammit (adamzammit) said :
#3

Dear Wayne,

Thank you for your thorough description. You have identified another bug where the manual training system is including the results of automatically verified fields before they have been corrected. I will file another bug and link it to this answer - it should be an easy fix.

Regards,
Adam Zammit

Can you help with this problem?

Provide an answer of your own, or ask Wayne Lewis for more information if necessary.

To post a message you must log in.