[1.0] [HowTo] turn on text recognition --- solution

Asked by Abhishek Lal

--- solution

TextSearch and OCR find("some text") and Region.text() are currently switched off in the default with respect to the many issues.

The related Settings options to be used (set to true) with Sikuli API (or outside the IDE):

Settings.OcrTextRead = true; // to switch on the Region.text() function

Settings.OcrTextSearch = true; // to switch on finding text with find("some text")

Running scripts from inside the IDE alternatively the respective options in the Preferences -> …more options can be selected.

Improvements for the text features are planned with version 1.1

-------------------------------------------------------------------

I just started with sikuli,

I am using java + sikuli api,
I am trying to get text from a region.

but what I am getting is

[error] Region.text: text recognition is currently switched off
--- no text ---

the code I am using is

Region r_TCA0_full = r_TCA0_head.below(532);
String txt = r_TCA0_full.text();
System.out.println(txt);

How to turn on the text recognition?

pacakges I already installed:
libtesseract-dev 3.02.01-2
libtesseract3 3.02.01-2
tesseract-ocr 3.02.01-2
tesseract-ocr-eng 3.02.01-2
tesseract-ocr-osd 3.02.01-2

I am on linux, (ubuntu 12.04), using Sikuli-API-1.0.0-Lnx32.zip

*Actually I need to get the list of items in a list box

Question information

Language:
English Edit question
Status:
Solved
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Solved by:
RaiMan
Solved:
Last query:
Last reply:
Revision history for this message
Best RaiMan (raimund-hocke) said :
#1

see in question above

Revision history for this message
Abhishek Lal (abhisheklalnediya) said :
#2

Thanks RaiMan, that solved my question.

Revision history for this message
syed (emailsyed245) said :
#3

I am new in Sikuli and doing R&D on OCR feature specifically. Please help me with the following.
I have switched OCR functionality on by using the your above suggested code in eclipse/java
   Settings.OcrTextRead=true;
   Settings.OcrTextSearch=true;
and try to extract the text from a textfield(which is not editable) but it saying --- no text ---

I can not even use the following solution for extracting text.
scr.click("text_field.png",0)
scr.type("a", KEY_CTRL)
scr.type("c", KEY_CTRL)
System.out.println(Env.getClipboard())

Will b much obliged by your quick response
Environment:Windows7, IDE: Sikuli 1.0.1,
Jar Files: sikuli-api-1.0.2-standalone.jar ; sikuli-ide.jar

Revision history for this message
RaiMan (raimund-hocke) said :
#4

It seems, that you did not select option 5 at setup (want to use Tesseract ....)

- delete everything except sikuli-setup.jar from the SikuliX setup folder
- repeat setup with options 1 and 5
- should succeed
- delete the libs folder in SikuliX setup folder
- start the SikuliX IDE (the libs folder will now be recreated including the tessdata folder)

Region.text() should now return something, but might be not what you expect ;-) (the SikuliX OCR feature is not mature yet, see related bugs)

Revision history for this message
RaiMan (raimund-hocke) said :
#5

comment #4 is for syed's question from comment #3

Revision history for this message
syed (emailsyed245) said :
#6

Thanks for your quick response RaiMan :)
I have deleted every thing from folder except sikuli-setup.jar (as you mentioned above). And re-install the setup with option 1& 5 and turn on the settings but still the same result --- no text -----

here is my simple code .....
   In Eclipse/Java
       Screen scr=new Screen();
        scr.find("images/imagesCommon/NumberOfDocsLabel.png").right(150).highlight(3);
     String txt=scr.find("images/imagesCommon/NumberOfDocsLabel.png").right(150).text();
     System.out.println("Text :"+txt); //<- It gives --- no text ---

In Sikuli IDE
     find(NumberOfDocsLabel.png).right(25).right(90).highlight(3) #<- For highlighting box
    popup(find(NumberOfDocsLabel.png).right(25).right(90).text()) #<- It gives --- no text ---

Note: I emailed you the screenshot as well with the bug#. [as I can't see any link to upload screenshot here :(]
If you could further advise me in this regard will be much appreciated.
p.s: My windows7 is 64 bit.

Revision history for this message
syed (emailsyed245) said :
#7

And if OCR does not work in Sikuli properly then could you please suggest me simple API to use in Java to do the OCR of the image along with Sikuli java code.
So I can extract the texts by OCRing the images (captured through Sikuli )and compare it with expected values.
Many Thanks..!!

Revision history for this message
RaiMan (raimund-hocke) said :
#8

at comment #6:
I got your mail and made my tests on your shot.
I did not really have problems to read the texts on the shot, but the current version has problems with number only text like in your right column. And these kind of "one-pixel" fonts are a real challenge for Tesseract principally.

This is the most important thing with Sikuli's Region.text():
the image used for OCR should only contain one line of text without any surrounding pixel clutter.
Your image contains the upper border of the text field.

To simply make some OCR tests on the screen, I use this one-liner in a separate tab in the IDE:
print "]" + selectRegion().text() + "["

this will allow you to interactively select some text on the screen and instantly see the result.
in a script, you have to add some additional adjustments for the region (have a look at the function Region.grow() in Java), to avoid pixel clutter.

If the OCR quality is not sufficient for you, then you do not have a chance to improve it currently.
If you find some rules for the bads, then you might try to compensate them by scripting.

Revision history for this message
RaiMan (raimund-hocke) said :
#9

@comment #7
There are surely some possibilities to get some good-quality OCR.

The nearest one:
If you want to try to do OCR yourself from within a Java program, you should have a look at Tess4J (Tesseract wrapper using JNA) http://tess4j.sourceforge.net. I am planning to use this in the next SikuliX version 1.2 instead of the native stuff.
It seems to be rather actual (09/2013 basing on Tesseract 3.02).

Currently I cannot tell you, how you feed your images captured with Sikuli to Tess4J. But if you decide, to try this way, I will surely help you to get on the road.

In this case, you should start to communicate with me directly using the mail button at sikulix.com top right with reference to this question.

Revision history for this message
syed (emailsyed245) said :
#10

Many Thanks RaiMan again for your useful input.
Eventually I have managed to get the Sikuli OCR working. The thing which Sikuli wasn't installing was the 'tessdata' folder with 'English' language provision(file) in it. It was showing sort of following Error on my 64-bit windows command line window(not on my 32-bit laptop, although was not working there as well)
------Message on Command line Window-----------
Error opening data file F:/sikuli/libs/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
-------------End---------------
So I what I did, I downloaded tess4j from the following link and unzip it, take the tessdata foler out of it and put it into the libs folder inside Sikuli and set the evnironment variable path of TESSDATA_PREFIX to sikuli libs folder.
http://sourceforge.net/projects/tess4j/

And it was working fine after that on both systems i.e at my work(64-bit windows7) and at home(32-bit windows7).
Although it is doing OCR but sometimes mess up with digits which is most important in Testing :(. [emailed you screenshot]

I have tried other OCR Java API , AspriseOCR.jar , but its worse than Sikuli OCR built in API.
Then I was trying to tess4j but couldn't get it right on my eclipse due to 64-bit system and it has 32-bit dll files. :(
If you don't bother and have time, could you please tell me how to setup tess4j on 64-bit system. It showing me following error
--------Error Message----------
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'libtesseract302': Native library (win32-x86-64/libtesseract302.dll) not found in resource path
------End-------------
There are already two dll files(liblept168.dll and libtesseract302.dll) which i copy-pasted on c:\windows\system32 folder and also include it in folder(tesseractlib) on my project folder. I have already the following jar files on my build path
ghost4j-0.3.1.jar
jna.jar
jai_imageio.jar
junit-4.10.jar
tess4.jar
and in my code i am also trying to load them by doing this
---------------code---------
public static void main(String[] args) {
  System.setProperty("jna.library.path", "//tesseractlib//");
  File file=new File(System.getProperty("user.dir")+"//tempOCRImages//no1DropDown.png");
  Tesseract t4j= Tesseract.getInstance();
     try{
      String text=t4j.doOCR(file);
      System.out.println("Text is :"+ text);
     }catch(Exception e){
      System.out.println(" IN the exception : "+e);
     }
 }
---------End code-----------
Sorry to post a deviated post here but it may be help anyone else who really need OCRing the Image for Testing purposes.
Thanks

Revision history for this message
RaiMan (raimund-hocke) said :
#11

-- ERROR: Error opening data file F:/sikuli/libs/tessdata/eng.traineddata
Sorry, that you had so much effort.
If you have setup with option 5 and get this message, then simply delete the libs folder and restart the IDE: now the libs folder will be recreated and now with tessdata (fixed with version 1.0.1)

-- Tess4J
I downloaded it to my Mac and played around a little bit.
It looks very promising and I will integrate it with SikuliX, since then everything can be handled on the Java level.
… but with your numbers Tesseract has problems anyway (made some tests)

So I guess for the moment, using Sikuli's OCR is the easiest choice.

The stony way:
Tess4J: Besides the Windows 64Bit challenge (you might switch to using the 32-Bit Java), the problem with Tess4J, if you want to combine it with SikuliX: the normal operation expects 300dpi images, images Sikuli takes from the screen are 72dpi to 96dpi. So they have to be scaled up and need to be grayscale (which Sikuli currently does internally before giving the image to Tesseract)
Both can be done in Java with image processing using the graphics context.

Revision history for this message
RaiMan (raimund-hocke) said :
#12

--- Sorry to post a deviated post here but it may be help anyone else who really need OCRing the Image for Testing purposes.

That really is nor problem in any case. working alternatives, that might be possible to combine with Sikuli are always welcome.

Revision history for this message
Pranav (pranav-avhad2009) said :
#13

Hello RaiMan,

I am working on windows based application UI testing..I want to extract all text(includes label text,text field) from application.
I have tried to get it using below, it retrives data ,but some data is incorrect.

used below code:
String text = TextRecognizer.getInstance().recognize(ImageIO.read(new File(path)));

As you also suggested that making greyscale image, improves text recognization..

how to do it?

Revision history for this message
RaiMan (raimund-hocke) said :
#14

@Pranav
To extract text from an image given as file, you should use the Tesseract package itself from command line.

The OCR feature of SikuliX still is not reliable in such cases.

Another option is Tess4J, that lets you do the job in Java.

Revision history for this message
Pranav (pranav-avhad2009) said :
#15

Hello RaiMan,

Thank you for reply.

I am able to get it using Tess4j. But the data extracted is not 100% accurate(Also some data is missing )

Is there any other way to get all data accurately.?

I have below code to extract text using Tess4j
-----------------------------------------------------------------------------------------------------------
 File imageFile = new File( "MainScreen.png");

       // Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
         Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
         instance.setDatapath("F:\\TEST AUTOMATION\\LDTP_SIKULI\\SIKULI_LDTP_STUFF\\Tess4J-2.0-src\\Tess4J");

        try {
            String result = instance.doOCR(imageFile);
            //instance.doOCR(new Buff)
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }

Revision history for this message
RaiMan (raimund-hocke) said :
#16

Sorry, I am not (yet ;-) an expert in using Tesseract.

If you want to improve your results, you have to dive into the features of Tesseract (like learning a new font) including the possibilities to pimp up the image for OCR (enlarge, grayscale as light on dark, ...)

Currently I cannot help you further.
Scan the net for Tesseract usage and look into their faqs.

Revision history for this message
JonyGreen (jonygreen) said :
#17

if you like tesseract ocr, you may like this free online ocr tool http://www.online-code.net/ocr.html using tesseract ocr 3.02

Revision history for this message
Theo Ronna (ronna88) said :
#18

where exactly i change this options?

TextSearch and OCR find("some text") and Region.text() are currently switched off in the default with respect to the many issues.

The related Settings options to be used (set to true) with Sikuli API (or outside the IDE):

Settings.OcrTextRead = true; // to switch on the Region.text() function

Settings.OcrTextSearch = true; // to switch on finding text with find("some text")

i use eclipse to programming and still receiving the message
"[error] text: text recognition is currently switched off
--- no text ---"

Revision history for this message
RaiMan (raimund-hocke) said :
#19

@Theo
supposing it is version 1.1.0:
I guess, you do not have a valid Tesseract setup.

How do you get your sikulixapi.jar referenced in your Eclipse project?

Did you run setup or is it a Maven project?

Revision history for this message
Theo Ronna (ronna88) said :
#20

@RaiMan
i did the setup, with the options 1, 2 and 3.
 Screen s = new Screen();
  String txt;
  try {
   txt = s.find("imgs/teste.png").right(150).text();

       System.out.println("Text :"+txt);
  } catch (FindFailed e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }

and always have [error] text: text recognition is currently switched off
--- no text ---

Revision history for this message
Theo Ronna (ronna88) said :
#21

@RaiMan
i found my error sorry! and thanks for help!

Revision history for this message
Karthikk Sanku (sanku-karthik) said :
#22

@Raiman
Hi Raiman,

After trying many things still i coulnt ablt to get text from image using sikuli tesseract

Below are the steps i am following

1) I downloaded sikulixsetup1.1.0 and extracted sikulixapi.jar by selecting 2 and 3 options in there (Many in the form said they extracted from 1.2 5 options or 3, 5 options for tesseract. But for me I can see only 3 options)
2) Then I used the below code
           Pattern p=new Pattern("C:\\Testing\\images\\image1.png");
            File imageFile = new File("C:\\Testing\\images\\image1.png");
           Match txt=s.find(p).text();
            s.find(p).highlight(10);
            System.out.println(txt);
3) But Always I get 'txt' = no text
 Log is
     jul 19, 2016 12:59:18 PM org.bridj.BridJ log
   INFO: Registering type org.sikuli.util.SysJNA$WinKernel32
   jul 19, 2016 12:59:18 PM org.bridj.BridJ log
   INFO: Registering type org.bridj.TimeT
   jul 19, 2016 12:59:18 PM org.bridj.BridJ log
  INFO: Registering type org.bridj.TimeT$timeval_customizer
  jul 19, 2016 12:59:18 PM org.bridj.BridJ log
  INFO: Registering type org.bridj.StructIO$DefaultCustomizer
  jul 19, 2016 12:59:18 PM org.bridj.BridJ log
  INFO: Registering type org.bridj.TimeT$timeval
  jul 19, 2016 12:59:18 PM org.bridj.BridJ log
  INFO: Registering type org.bridj.StructObject
  jul 19, 2016 12:59:18 PM org.bridj.BridJ log
  INFO: Registering type org.bridj.NativeObject
  jul 19, 2016 12:59:18 PM org.bridj.BridJ log
  INFO: Registering type org.bridj.AbstractIntegral
  jul 19, 2016 12:59:18 PM org.bridj.BridJ log
  INFO: Registering type java.lang.Number
  [error] text: text recognition is currently switched off
  --- no text ---

Please let me know what could be the problem . Am I doing any mistake?

Revision history for this message
SreeCharan Shroff (charaan) said :
#23

hi Raiman,

screen.click(screen.findText("File"))
works fine with sikulix setup installed.

please let me know how to get it working with a maven project without sikulix setup.
As of now, it fails as follows:

[error] TextRecognizer: init: export tessdata not possible - run setup with option 3
Exception in thread "main" FindFailed: null
  Line 2535, in file Region.java

 at org.sikuli.script.Region.wait(Region.java:2535)
 at org.sikuli.script.Region.findText(Region.java:2640)
 at org.sikuli.script.Region.findText(Region.java:2651)
 at demoforpk.TestSikuliX.main(TestSikuliX.java:12)
[error] TextRecognizer not working: tessdata stuff not available at:
C:\Users\Charan\AppData\Roaming\Sikulix\SikulixTesseract\tessdata

Thanks,
Charan

Revision history for this message
Kogul (selvanathan4220) said :
#24

Is there any solution for ? Please suggest

[error] text: text recognition is now switched off
--- no text ---