[1.1.4] IDE: OCR Tuning

Asked by Jan

I am not a developer and absolute new to Tesseract. I tried to understand the Tesseract Documentation on GizHub, but it is not clear for me what functionality of Tesseract can(should be imported/used in SikuliX (e.g. "copy only the traineddata-files of your language into Sikulix AppData folder").
Now I want to read some foldernames in my Windows 10 Explorer when storing the output of a Web-App locally and, based on the OCR-result, change the folder or create a new subfolder. I assume that Windows 10 uses Segoe fonts. I have a German special sign in my root-folder path, the OCR-Result is: "Dieser PC > Lokaler Datentréger(Cz) > ..."), This can also be a Sikulix Issue, but I can use a workaroud for this.

My Issue:
When embedded between a meaningless mixture of numbers and characters a lower "l" ( like Lima) allways(100%!!!) gets recognized as a pipe symbol ( | ). In addition more than 70% of upper "O" ( like "Oscar") in same scenario gets recognized as "0" (Zero) and vice versa the zero.
Zooming the size of characters in Windows Explorer to 150% didn't help. I am assuming the root cause in use different fonts.

My questions:
1. How can I tell Tesseract-OCR that it should try to recognize Segoe-fonts.
2. Until now I just added German traindata-files to the Tesseract folder. Are there some font sets to add?
3. Can I provide a blacklist of characters to Tesseract-OCR, saying that there will never be a pipe-symbol in the text.
4. What are the standard fonts of the current version of Tesseract wich is embedded in SikuliX 114. My idea is to switch (and switch back) the standard fonts of Windows Explorer compliant to Tesseract.

Thanks a lot in advance!

Question information

Language:
English Edit question
Status:
Answered
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

--- OCR-Result is: "Dieser PC > Lokaler Datentréger(Cz) > ..."),
I guess you are using print ocrResult to get this.
The normal print statement cannot print UTF-8 strings (the OCR results)
try with:
uprint(ocrResult)

For the OCR problems generally, I cannot help you.

You either have to find a font that works or tweak Tesseract with the learning tools it offers (and finally add the traineddata to SikuliX environment). I have no experience with that, since I never did this before.

BTW: moving visually with OCR through explorer-trees is a huge effort. Python has nice features to easily access the file system.

Revision history for this message
Mike (maestro+++) said :
#2

In Windows Explorer you can Copy Path. Your script can then read the file path using :

pth = Env.getClipboard()

You can then manipulate the string in python.

Note there is a keyboard shortcut: Shift+Menu, "a".

I'll raise a request to implement Menu on SikuliX

Revision history for this message
Mike (maestro+++) said :
#3

Actually you can get to the Menu dialogue using SHIFT F10

Can you help with this problem?

Provide an answer of your own, or ask Jan for more information if necessary.

To post a message you must log in.