Text OCR related question

Asked by Subhash

HI,

I am using Sikuli for scraping text fron screen.
I am on version 1.0 rc3

I am having good success so far in terms of recognition itself.

However, the use case I have is to specifically identify text that ends with a colon. This will help me specifically tag text with greater accuracy and I need to find text on either side of colon (:)

I am using Region.text() the get me all text tokens from a screen region.

However I see that during the OCR process, it specically filters out the colon from the image.
I can see this from the intermediary files created (xxx-lineblobs.vlog.png has the colon and then the subsequent processed output xxx-lineblobs-filtered.vlog.png has it removed)

This is very critical in my processing step.
Is there any way I can configure for this to be considered as another character and be retained in the OCR results.

Regards
Subhash

Question information

Language:
English Edit question
Status:
Answered
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

sorry, no way.

You would have to adapt the C++ source code and then build the stuff from the sources.

Another option is to use Tesseract from command line:
- create an image file from the respective region using capture
- run the Tesseract command from inside the script using e.g. subprocess or Java RunTime.exec
- get the output of the Tesseract command (e.g. stdout)
With this you can use any option Tesseract 3 is offering.

Greater improvements of the SikuliX OCR feature will only be contained in version 2 beginning somewhen in 2016.

Revision history for this message
Subhash (subhash-bylaiah) said :
#2

Thanks much, Raimund

Can you help with this problem?

Provide an answer of your own, or ask Subhash for more information if necessary.

To post a message you must log in.