utf-8 file content not recognized in Sikuli script --- use provided command files or Java option -Dfile.encoding=UTF-8

Asked by obaskirt

Hello,

I saved this line in a txt file.
"Fiber / Site Yöneticilerine 24 ay Ücretsiz Fiber Internet - Generic"

ö,Ü are special characters. When I tried to compile the code below I got this result
"Fiber / Site Y�neticilerine 24 ay �cretsiz Fiber Internet - Generic"

Here is my code:
fiberSunulariLoc="C:\\Users\\Onur\\Desktop\\SIKULI_PROJs\\Config\\FiberSunuListesi.txt"
fiberLocationIDFile=open(fiberLocationIDLoc,'r')
fiberLocationIDArr=list(fiberLocationIDFile.readlines())
fiberLocationIDFile.close()
print(fiberLocationIDArr[0])

I need your help to solve this problem. Thanks in advance.

Question information

Language:
English Edit question
Status:
Solved
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Solved by:
obaskirt
Solved:
Last query:
Last reply:

This question was reopened

Revision history for this message
obaskirt (onur-baskirt) said :
#1

I tried below code but it did not work

import codecs
fiberSunulariFile=codecs.open(fiberSunulariLoc, 'r', encoding='utf-8')
fiberSunulariArr=fiberSunulariFile.readlines()
print(fiberSunulariArr[0])

I got below result:
fiberSunulariArr=fiberSunulariFile.readlines()
 File "C:\Program Files\Sikuli X\sikuli-script.jar\Lib\codecs.py", line 626, in readlines
 File "C:\Program Files\Sikuli X\sikuli-script.jar\Lib\codecs.py", line 535, in readlines
 File "C:\Program Files\Sikuli X\sikuli-script.jar\Lib\codecs.py", line 424, in read
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 14-17: invalid data

Revision history for this message
obaskirt (onur-baskirt) said :
#2

I tried below code but it did not work

import codecs
fiberSunulariFile=codecs.open(fiberSunulariLoc, 'r', encoding='utf-8')
fiberSunulariArr=fiberSunulariFile.readlines()
print(fiberSunulariArr[0])

I got below result:
fiberSunulariArr=fiberSunulariFile.readlines()
 File "C:\Program Files\Sikuli X\sikuli-script.jar\Lib\codecs.py", line 626, in readlines
 File "C:\Program Files\Sikuli X\sikuli-script.jar\Lib\codecs.py", line 535, in readlines
 File "C:\Program Files\Sikuli X\sikuli-script.jar\Lib\codecs.py", line 424, in read
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 14-17: invalid data

Revision history for this message
obaskirt (onur-baskirt) said :
#3

Finally I solved this problem in this way:

import codecs
fiberSunulariFile=codecs.open(fiberSunulariLoc, 'r', encoding='utf-8')
fiberSunulariArr=fiberSunulariFile.readlines()
fiberSunulariArr[0]= fiberSunulariArr[0].encode( "utf-8" )
print(fiberSunulariArr[0])

Revision history for this message
obaskirt (onur-baskirt) said :
#4

Finally I solved this problem in this way:

import codecs
fiberSunulariFile=codecs.open(fiberSunulariLoc, 'r', encoding='utf-8')
fiberSunulariArr=fiberSunulariFile.readlines()
fiberSunulariArr[0]= fiberSunulariArr[0].encode( "utf-8" )
print(fiberSunulariArr[0])

Revision history for this message
RaiMan (raimund-hocke) said :
#5

I could not reproduce this behavior on Windows 7 nor on Mac 10.6 with X-1.0rc3

If the file contains utf-8, these are correctly read and printed in the message area.

*** utf8Test.sikuli
import os
dir = getBundlePath()
fn = os.path.join(dir, "utf8.txt")
f = open(fn)
print f.readlines()[0]
f.close()

*** utf8.txt in utf8Test.sikuli
Dieser Text enthält Non-Ascii Zeichen: 今天在地鐵站看到竟然有人在彈豎琴

The message area in Sikuli IDE shows as expected:
Dieser Text enthält Non-Ascii Zeichen: 今天在地鐵站看到竟然有人在彈豎琴

Revision history for this message
Michał (konopacki-m) said :
#6

Sorry for commenting on this old bug, but I think I know the reason why that happened. I had the same issue recently on my Windows 7 , and the problem was that I was opened sikuli-ide.jar directly. When sikuli-ide was opened through runIDE.bat the problem dissapeared.

Revision history for this message
RaiMan (raimund-hocke) said :
#7

@ Michal
thanks for commenting anyway.

the difference between a plain java execution (double clicking a runnable jar) and using one of the provided command files is the Java option -Dfile.encoding=UTF-8 which is present in the letter case and tells Java, that file content is encoded in UTF-8.

IN the first case, the encoding setting seen by Java is the one from your system settings/Java standard settings, which usually on Windows is something else than UTF-8.