tessedit_char_whitelist

Asked by John Nilson

I'm trying to override the tessedit_char_whitelist Tesseract config parameter. I want to tell Tesseract to only match on AlphaNumberic characters (don't include punctuation etc). The full parameter definition is:

tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

Problem is I don't know where to put this in the tessdata folder. Someone said use the "bazaar" pattern matching file. In other words define a file in the config directory called "bazaar_test" and put that line in it.

First I tried it with a stand alone install of Tesseract and it actually worked! Then I tried it in Sekuli's tessdata directory but didn't have any luck.

Has anyone ever tried to do this before?

Question information

Language:
English Edit question
Status:
Solved
For:
SikuliX Edit question
Assignee:
No assignee Edit question
Solved by:
John Nilson
Solved:
Last query:
Last reply:
Revision history for this message
RaiMan (raimund-hocke) said :
#1

RaiMan cannot help, since I do not have any special Tesseract knowledge yet (maybe later this year, when I re-implement the text features using Tess4J)

Revision history for this message
RaiMan (raimund-hocke) said :
#2

sorry, that was not an answer, but only a comment.

Revision history for this message
John Nilson (jnnilson) said :
#3

I was able to solve this problem by taking the following steps:

1) Downloaded Tesseract so I could get the utilities to unpack the eng.traineddata file
2)Added the Tesseract directory to my executable path.
3)Switched directories to the Sikuli\libs\tessdata directory
4) copied eng.* files into a new "Unpacked" directory I created. Then ran unpacked:
C:\Program Files (x86)\Sikuli\libs\tessdata\Unpacked>combine_tessdata -u eng.traineddata ./eng2.
Extracting tessdata components from eng.traineddata
Wrote ./eng.config
Wrote ./eng.unicharset
Wrote ./eng2.unicharambigs
Wrote ./eng2.inttemp
Wrote ./eng.pffmtable
Wrote ./eng.normproto
Wrote ./eng.punc-dawg
Wrote ./eng.word-dawg
Wrote ./eng.number-dawg
Wrote ./eng.freq-dawg
Wrote ./eng.cube-unicharset
Wrote ./eng.cube-word-dawg
Wrote ./eng.shapetable
Wrote ./eng.bigram-dawg

5) Edited eng.config and added the line:
tessedit_char_whitelist abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

6)created a new eng.traineddata file using the following command:
C:\Program Files (x86)\Sikuli\libs\tessdata\Unpacked>combine_tessdata eng.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is 140
Offset for type 1 is 358
Offset for type 2 is 7643
Offset for type 3 is 8690
Offset for type 4 is 980283
Offset for type 5 is 981099
Offset for type 6 is 997382
Offset for type 7 is 1001704
Offset for type 8 is 2085898
Offset for type 9 is 2112548
Offset for type 10 is -1
Offset for type 11 is 2113958
Offset for type 12 is 2115469
Offset for type 13 is 3177575
Offset for type 14 is 3240921
Offset for type 15 is -1
Offset for type 16 is -1

7) copied over the existing eng.traineddata with the eng.traineddata I had just created in the Unpacked directory

8) started Sikuli IDE and voila, I only read AlphaNumeric characters.

Revision history for this message
RaiMan (raimund-hocke) said :
#4

… great. thanks for your contribution.

I guess in a few weeks I will understand what you have done ;-)