SikuliX

tessedit_char_whitelist

Asked by John Nilson on 2014-09-03

I'm trying to override the tessedit_char_whitelist Tesseract config parameter. I want to tell Tesseract to only match on AlphaNumberic characters (don't include punctuation etc). The full parameter definition is:

tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

Problem is I don't know where to put this in the tessdata folder. Someone said use the "bazaar" pattern matching file. In other words define a file in the config directory called "bazaar_test" and put that line in it.

First I tried it with a stand alone install of Tesseract and it actually worked! Then I tried it in Sekuli's tessdata directory but didn't have any luck.

Has anyone ever tried to do this before?

Question information

Language:: English Edit question

Status:: Solved

For:: SikuliX Edit question

Assignee:: No assignee Edit question

Solved by:: John Nilson

Solved:: 2014-09-04

Last query:: 2014-09-04

Last reply:: 2014-09-04

Revision history for this message

RaiMan (raimund-hocke) said on 2014-09-04:

RaiMan cannot help, since I do not have any special Tesseract knowledge yet (maybe later this year, when I re-implement the text features using Tess4J)

Revision history for this message

RaiMan (raimund-hocke) said on 2014-09-04:

sorry, that was not an answer, but only a comment.

Revision history for this message

John Nilson (jnnilson) said on 2014-09-04:

I was able to solve this problem by taking the following steps:

1) Downloaded Tesseract so I could get the utilities to unpack the eng.traineddata file
2)Added the Tesseract directory to my executable path.
3)Switched directories to the Sikuli\libs\tessdata directory
4) copied eng.* files into a new "Unpacked" directory I created. Then ran unpacked:
C:\Program Files (x86)\Sikuli\libs\tessdata\Unpacked>combine_tessdata -u eng.traineddata ./eng2.
Extracting tessdata components from eng.traineddata
Wrote ./eng.config
Wrote ./eng.unicharset
Wrote ./eng2.unicharambigs
Wrote ./eng2.inttemp
Wrote ./eng.pffmtable
Wrote ./eng.normproto
Wrote ./eng.punc-dawg
Wrote ./eng.word-dawg
Wrote ./eng.number-dawg
Wrote ./eng.freq-dawg
Wrote ./eng.cube-unicharset
Wrote ./eng.cube-word-dawg
Wrote ./eng.shapetable
Wrote ./eng.bigram-dawg

5) Edited eng.config and added the line:
tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

6)created a new eng.traineddata file using the following command:
C:\Program Files (x86)\Sikuli\libs\tessdata\Unpacked>combine_tessdata eng.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is 140
Offset for type 1 is 358
Offset for type 2 is 7643
Offset for type 3 is 8690
Offset for type 4 is 980283
Offset for type 5 is 981099
Offset for type 6 is 997382
Offset for type 7 is 1001704
Offset for type 8 is 2085898
Offset for type 9 is 2112548
Offset for type 10 is -1
Offset for type 11 is 2113958
Offset for type 12 is 2115469
Offset for type 13 is 3177575
Offset for type 14 is 3240921
Offset for type 15 is -1
Offset for type 16 is -1

7) copied over the existing eng.traineddata with the eng.traineddata I had just created in the Unpacked directory

8) started Sikuli IDE and voila, I only read AlphaNumeric characters.

Revision history for this message

RaiMan (raimund-hocke) said on 2014-09-04:

… great. thanks for your contribution.

I guess in a few weeks I will understand what you have done ;-)

To post a message you must log in.

Ask a question

Edit question

SikuliX

tessedit_char_whitelist

Question information

Related bugs

Related FAQ:

Subscribers