[1.1.0] Region.text(): chinese text not readable --- only possible with additional Tesseract actions

Asked by snowCao

************ solution

see how to do it: faq 2709

-------------------------------------------------

envirement: sikuliX1.1.0 on java
os:win 7 x64

the picture baidu.png contains chinese.

when i run the code like that : System.out.print(find(baidu.png).text());
the result is unreadable code.

how can i solved the problem.
tks!

Question information

Language:
English Edit question
Status:
Answered
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

SikuliX inside uses Tesseract.

This approach needs the latest 1.1.0 (no chance with earlier versions).

Step 1:
So if you want to use the SikuliX text feature in a serious way with chinese text, you have to first install Tesseract on your system, so you can use Tesseract standalone from command line.
After installation you have to try with the different language packages Tesseract offers.

Step 2:
If you succeed to read text from images in a sufficient way, then you have to copy the tessdata folder content to the SikulixTesseract folder and tell SikuliX to use this language pack for text reading.

If you decide, to go this way, then come back with the relevant information, after you have completed step1.
I will help you then, to get on the road with step2.

Revision history for this message
snowCao (c138255) said :
#2

step 1 has finished.
And i can read text form images when the text is english. but it won't work if the text is chinese.

step2:

I downloaded 'tesseract-ocr-3.02.eng.tar.gz' in the %AppData%\Sikulix\SikulixDownloads,
installed tessdata in %appdata%\Sikulix\SikulixTesseract\tessdata.

 copy the tessdata folder content to the SikulixTesseract folder:
 the tessdata folder is 'tesseract-ocr-3.02.eng.tar.gz' ? copy my picture to %appdata%\Sikulix\SikulixTesseract\tessdata? or others?

 tell SikuliX to use this language pack for text reading:
how i tell SikuliX to use this language pack?

Revision history for this message
RaiMan (raimund-hocke) said :
#3

Ok, might have been misunderstanding:

with SikuliX version 1.1.0+, you indeed get all you need for reading text in english language, just by selecting option 3 at setup.
So no need to fiddle around with the stuff in %appdata%\Sikulix, since it is simply there.

the above step1 was meant, that you install the complete Tesseract package on your system, so you can use the tesseract command from command line. And you should not mix this with the SikuliX installation. These are 2 different things.
At the time, you have completed step1 (meaning you have successfully read some chinese text from an image) then come back.

Revision history for this message
snowCao (c138255) said :
#4

sorry!my english is poor.

package test_sikuli;
import org.sikuli.script.*;
import org.sikuli.basics.Settings;
import java.io.*;

public class test {

 public static void main(String[] args) throws FindFailed, InterruptedException {
  // TODO Auto-generated method stub

  String str="cmd /c start firefox";
  try {
   Runtime.getRuntime().exec(str);
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  Screen a= new Screen();
  Settings.OcrTextRead=true;
  Settings.OcrLanguage="CHN";

  Thread.sleep(2000);
  String b=a.find("img/title.png").text();
  if(b !=null)
   System.out.print(b+"\n");
  else
   System.out.print("i bbbb\n");
  String c=a.find("img/titleChinese.png").text();
  if(c !=null)
   System.out.print(c+"\n");
  else
   System.out.print("i ccccc\n");

 }

}

the result is :
aboutzcehome title.png just only including english
—I—j titleChinese.png including chinese
 '\I/'

why still unread if the text is chinese?

Revision history for this message
RaiMan (raimund-hocke) said :
#5

I have tested and it works.

how to: faq 2709

Can you help with this problem?

Provide an answer of your own, or ask snowCao for more information if necessary.

To post a message you must log in.