Tesseract reads most numbers correctly but not all. how to improve ?

Asked by MP on 2019-05-03

So I tryed all the PSM setting of Tesseract but with all it still reads the "5" as an S, and the "30" as 3 O (so as letters instead of numbers).

NumberCheck: 5
[error] script [ app_part7 ] stopped with error at line --unknown--
[error] Error caused by: Traceback (most recent call last): File "X:\Sikulix projects\app_part7.sikuli\app_part7.py", line 23, in <module> if int(r.text()) == 0: ValueError: invalid literal for int() with base 10: 'S '

the weird part is , in the numbercheck (which I putted there as a check up) he reads the number correctly, but once I want to do a certain action with the read number, it is like he reads it again but then incorrect...

How can I change the config of Tesseract? I read the doc, but things like tr.setdpi(50) dont work...

so my questions:

1) I dont get the syntax how to change different parts of Tesseract

2) what is the best (read easiest way for a beginner in programming) to read those 2 numbers correctly? The numbers are really small like dpi 50 - 70 something. I read I should scale it to 300? How do I do that?

3) or is there another solution I am not seeing?

Question information

Language:
English Edit question
Status:
Answered
For:
Sikuli Edit question
Assignee:
No assignee Edit question
Last query:
2019-05-07
Last reply:
2019-05-07
RaiMan (raimund-hocke) said : #1

SikuliX 1.1.4? which?

please paste the relevant code snippet.

MP (jozzzzzz) said : #2

1.1.4

 r = Region(1059,318,18,17)
 regiononchangecheck = Region(1056,316,342,23)

    tr = TextOCR.start()
    tr.setPSM(6)

    def eventstop(event):
        event.stopObserver()

    regiononchangecheck.onChange(eventstop)
    regiononchangecheck.observe(1000000)

    print ("NumberCheck: ") + r.text()

    if int(r.text()) == 5:
         print ("0")
         action1 (click, etc)

MP (jozzzzzz) said : #3

https://stackoverflow.com/questions/4944830/how-to-make-tesseract-to-recognize-only-numbers-when-they-are-mixed-with-letter

1) So I found the settings I need that will solve my problem (i think).

I want that tesseract only uses numbers:

But again I have trouble using the right syntax in sikulix.

tr.setoutputbase digits (not working)

2) Another solution is:

===============================
 made it a bit different (with tess-two). Maybe it will be useful for somebody.

So you need to initialize first the API.

TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(datapath, language, ocrEngineMode);

Then set the following variables

baseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_LINE);
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, "!?@#$%&*()<>_-+=/:;'\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, ".,0123456789");
baseApi.setVariable("classify_bln_numeric_mode", "1");
====================================

Yet again I cant seem to get the syntax right. How do I let tesseract know in sikulix the baseApi commands?

MP (jozzzzzz) said : #5

Hey raiman,, As mentioned in the first post, I read the doc first.
Tr.setDPI(50) for example doesnt work. In the doc there is no answer how the syntax then must be.

MP (jozzzzzz) said : #6

Maybe I should mention:

I tryed ofc the command
tr.setVariable(variableKey, variableValue) from your doc before making this topic.

Can you show me how you do this for example with saying tessact only to read numbers ?

Cause the only one I get to work is tr.setPSM(number)

RaiMan (raimund-hocke) said : #7

I have no experience.

it is either
tr.setVariable()
... which sets a single option

or
tr.setConfigs()
... which IMHO accepts a list of confi file names

about both you have to consult the Tesseract docs (with 1.1.4 it is Tesseract 3)

Can you help with this problem?

Provide an answer of your own, or ask MP for more information if necessary.

To post a message you must log in.